arXiv: 1508.00249v2 [math.ST] 11 Jan 2016 


LEPSKPS METHOD AND ADAPTIVE ESTIMATION OF 
NONLINEAR INTEGRAL FUNCTIONALS OF DENSITY 


By Rajarshi Mukherjee*’ 77 Eric Tchetgen 
Tchetgen*’ 77 and James Robins*’ 77 

We study the adaptive minimax estimation of non-linear integral 
functionals of a density and extend the results obtained for linear and 
quadratic functionals to general functionals. The typical rate optimal 
non-adaptive minimax estimators of’’smooth” non-linear functionals 
are higher order U-statistics. Since Lepski’s method requires tight 
control of tails of such estimators, we bypass such calculations by a 
modification of Lepski’s method which is applicable in such situa¬ 
tions. As a necessary ingredient, we also provide a method to control 
higher order moments of minimax estimator of cubic integral func¬ 
tionals. Following a standard constrained risk inequality method, we 
also show the optimality of our adaptation rates. 


Introduction. Estimation of statistical functionals in nonparametric 
problems has received considerable attention over the last few decades. Of 
specific interest have been both linear and non-linear integral functionals of 
an underlying density. For example, a large body of work has focused on 
estimating the entropy of an underlying distribution. Beirlant et al. (1997) 
provides an overview of results and related techniques. More recent works 
include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 
2010; Pal, Poczos and Szepesvari, 2010). For more references and exam¬ 
ples one can refer to Kandasamy et al. (2014). We consider a framework 
for estimating such integral functionals of a density. In particular, suppose 
Xi,... ,X n are i.i.d on [0,1] with density f(x) with respect to Lebesgue 
measure. We take / € H(/3,C) where H(j3,C ) is a Holder ball of smooth¬ 
ness (3 and radius C > 0. We are interested in estimation of </>(/) where 
<f> : T > R is a non-linear functional of density and T refers to the class 
of all densities on [0,1]. It is well known (Birge and Massart, 1995) that if 
4>(f) = f T(f(x))d/j,(x), and T is sufficiently smooth, then the minimax rate 

8/3 

of estimation over / £ H(/3, C ) in squared error norm is n 1 + 4 / 3 when /3 < -j 
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and n" 1 for j3 > 

There exits extensive literature addressing minimax estimation of linear 
and quadratic functionals in density, white noise or nonparametric regres¬ 
sion models. Although definitely not exhaustive, a comprehensive snapshot 
of this immense body of work can be found in Bickel and Ritov (1988); 

Cai and Low (2003, 2004, 2005); Donoho, Liu and MacGibbon (1990); Donoho and Nussbaum 
(1990); Fan (1991); Hall and Marron (1987); Kerkyacharian et al. (1996); 

Laurent et al. (1996) and other references therein. Most estimators proposed 
in the literature above, which attain the minimax rate of convergence over 
certain smoothness classes of the underlying function, depend explicitly on 
the knowledge of the smoothness index of the class. In particular, a stan¬ 
dard technique in estimation of these functionals is expanding the infinite 
dimensional function of interest in an suitable orthonormal basis of L 2 [ 0 ,1] 
and estimate an approximate functional created by truncating the basis ex¬ 
pansion at certain point. The point of truncation decides the approximation 
error of the truncated functional as an surrogate for the actual functional 
and depends on the smoothness of the function of interest and approxima¬ 
tion properties of the orthonormal basis used. This point of truncation is 
then delicately balanced with the bias and variance of the resulting estima¬ 
tor and therefore directly depends on the smoothness of the function. Thus, 
it becomes of interest to understand the question of adaptive estimation i.e. 
the construction and analysis of estimators without prior knowledge of the 
smoothness. 

The question of adaptation of linear and quadratic functionals has been 
studied in detail as well (Cai et al., 2005a,b; Cai and Low, 2006; Cai et al., 

2006, 2008; Efromovich and Low, 1994; Efromovich et al., 1996b; Efromovich and Samarov, 
2000; Gine and Nickl, 2008; Klemela and Tsybakov, 2001; Laurent and Massart, 

2000; Low, 1992) and references therein. However, adaptive estimation of 
general non-linear functionals has not been addressed in complete general¬ 
ity. In this paper we address the problem of adaptive estimation of general 
smooth non-linear functionals of a density. 

Robins et al. (2008) have developed a concrete theory for addressing mini¬ 
max estimation of a class of non-linear functionals in certain non-parametric 
and semi-parametric problems under low regularity smoothness conditions. 

Following similar logic of construction Tchetgen et al. (2008) constructs a 
minimax estimator of f f 3 dfi for (3 < \. The specific minimax estimator is a 
third order U-statistics and the construction of the kernel depends explicitly 
on the knowledge of the underlying smoothness. Based on the technique of 
Birge and Massart (1995) one can show that the minimax estimator of a gen¬ 
eral non-linear functional (j>(f) = J T(f)d/j, for smooth T can be constructed 
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by using ideas from linear, quadratic and cubic functionals of the density 
and appealing to a standard Taylor expansion argument. While producing 
adaptive estimators of non-linear functionals as well, a similar strategy can 
be followed and it becomes crucial to understand the adaptive estimation of 
/ f 3 d/i. 

In our search for answers regarding j we start with looking at 

the general idea driving estimation of quadratic functionals of the den¬ 
sity i.e. 0(f) = / / 2 d/i. Gine and Nickl (2008) provide adaptive estima¬ 
tors of quadratic functional 0(f) where the estimators are based on certain 
types of second order U-statistics with specific kernels. Our results provide 
a richer class of estimators based on a compactly supported wavelet ba¬ 
sis and is in line with the theory established by Efromovich et al. (1996b); 
Gine and Nickl (2008). 

Unlike a quadratic functional, the minimax estimator for 0(f) = / f r d[i 
is in general a r th order U-statistic. In order to apply regular Lepski’s 
Method we will need to obtain suitable exponential deviation inequalities 
for higher order U-statistics. Although, such moment inequalities do exist 
(Adamczak et al., 2006), the bounds include complicated quantities which 
needs to be controlled in a problem specific manner. To bypass such com¬ 
plications, we employ a modification of Lepski’s method where we test for 
the smoothness using our previously obtained second order U-statistics and 
use the selected smoothness to estimate the required functional. 

In this paper we focus on adaptation over the non —^/n regime i.e. 0 < 
although our proofs carries over easily for —\/n range i.e. 0 > ^ as well where 
adaptation is possible without paying a price and it is possible to achieve 
asymptotic efficiency for 0 > Moreover, the case of higher dimensions 
of X i.e. d > 1 can also be achieved by arguments similar to those in this 
paper. A brief discussion of these possible extensions is given in Section 8. 

The main contributions of this paper are as followed. This work extends 
previous results obtained for linear and quadratic functionals to new adap¬ 
tation theory for non-linear integral functionals of the density. In order to do 
so, we develop a suitable variant of Lepski’s method which bypasses estab¬ 
lishing exponential tail bounds for estimators of general non-linear integral 
functionals. In applying this modified Lepski’s method, the main challenge 
lies in obtaining suitable bounds on higher order moments of suitable es¬ 
timators of general non-linear integral functionals. This requires control of 
moments of higher order U-statistics based on orthogonal projection kernels. 
We crucially use the structure of the projection kernels based on compactly 
supported wavelet basis. Following ideas from Robins et al. (2015), we im¬ 
plement a binning argument to keep track of membership of sampled obser- 
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vations in partitions of the sample space created by a particular resolution 
level of the wavelet expansion, and this turns out to be crucial in controlling 
the higher order moments of our estimator at the right level. 

We would like to mention that our proofs for analyzing adaptive estimator 
of general nonlinear integral functional of the density uses projection kernels 
based on Haar Basis expansion. We provide proof for general compactly 
supported wavelet based procedure when analyzing the quadratic functional. 
However, for analysis regarding general nonlinear integral functional of the 
density, the requirement of Haar Basis arises at one specific instance in 
the proof. We expect that similar results continue to hold for more general 
compactly supported wavelets using arguments developed herein. However 
we do not further consider other wavelet bases here. 

The paper is organized as follows. In Section 1, we discuss Lepski’s method 
the main idea of the paper. In Section 2, we introduce basic notations, termi¬ 
nology and some theory about compactly supported wavelets we will need in 
the sequel. In Section 3, we study the estimation of the quadratic functional 
and provide a general class of adaptive estimators when the kernels of the 
U-statistics are based on a class of compactly supported wavelet bases. Sec¬ 
tion 4 is devoted to the development of a modified Lepski’s method suitable 
for analyzing higher order moments of the density. We then use this results 
to develop an adaptive estimator of f in section 5. Section 6 is used 
to understand how construction of an adaptive estimator of a general non¬ 
linear functional / T(f)dfi for smooth T can be derived from our analyses 
in Section 5. A lower bound on the required price of adaptation is provided 
in Section 7. Section 8 contains some discussions and future work. Finally 
all proofs are collected in the Appendix. 

1. Lepski’s Method and Heuristics of the Main Idea. 

The purpose of this section is to heuristically explain the main idea behind 
modifying Lepski’s Method suitably in our context. We do this at level of 
abstraction and do not provide formal results. Sections 4 and 5 are devoted 
to making the heuristics of this section more precise. We begin with a dis¬ 
cussion of Lepski’s Method as a recipe for producing adaptive estimators 
from a sequence of candidate estimators. Our discussion is inspired by the 
wonderful article of Birge (2001). As astutely observed by Birge (2001), Lep¬ 
ski’s Method can be succinctly described in a relatively abstract set up as 
follows. 

Using notations in essence similar to Birge (2001), consider a family of 
experiment spaces {(x, B(x), P), P € .Sejflee for measurable space x> sigma 
field B(x) and probability measure P in one of the parameter sets {Stifle©. 
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With an i.i.d sample of size n from such an experiment, we consider uniform 
rates of convergence for estimation of certain objects of interest over these 
parameter sets. In particular, suppose s : P —>• y is an object of interest 
for the experiment of interest with y being a pseudo-metric space equipped 
with pseudo-metric For a given estimator s based on sample of size 

n of a required object of interest s, define its rate of convergence over Sg as 
r(9,s) = sup sg s Eo (d g (s, s)), where q > 1. An estimator s is often called 
(minimax) rate optimal over Sg if r(9,s) x r(9) := inf jr(9,s). 

Now, starting from a family of rate optimal estimators Lepski’s 

Method provides a strategy for building a new estimator 9 which has “good” 
performance simultaneously over all sets Sg,9 € 0. Working under the 
regime of estimating a whole function in the context of nonparametric regres¬ 
sion, Lepskii (1991, 1992, 1993) develop such a strategy. Later these methods 
have been further used by Efromovich and Low (1994); Efromovich et al. 
(1996b); Gine and Nickl (2008); Klemela and Tsybakov (2001) and others 
for studying adaptation theory of linear and quadratic functionals in non¬ 
parametric problems. The method can be summarized roughly in brief as 
follows. 

Suppose that 0 C 1 is a bounded subset such that Sg is non-decreasing 
with respect to 6, the risks and minimax rates r(9,sg),r(9) are continuous 
with respect to 9 , and for each 0 6 0,] a rate optimal estimator sg where 
for large enough n, d q (s, sg ) is suitably concentrated around its expectation. 
One then chooses, for each n, a suitable fine discretization 9 1 < ... < 9x( n ) 
of 0 and finally for some large enough constant C define the candidate 
estimator to be sg. where 

3 

j := inf {j < K(n ) : d q (Sg j ,sg l ) < Cr(9i,sg t ), VZ <E . 

Let us try to intuitively elaborate on the method described above. In its 
heart, the definition of j is devoted to choosing the “best” 9 from a a point 
of view of the risk of the estimator of interest. Often, the “best” 6 is the one 
corresponding to the unknown data generating mechanism and if j selects 
a value “close enough” to this 9 with high probability, then owing to the 
continuity property of the risk, one obtains desired adaptive performance of 
the final estimator sg. . The required high probability selection of the desired 
9 is driven by the concentration of d q (s, sg ) around its expectation. 

Consider now a situation where suitable concentration of d q (s, sg) around 
its expectation is not easily achieved. This is often the case when the se¬ 
quence of estimators {sejee© is n °t sufficiently “nice” for application of 
standard concentration inequalities. In such cases there can be two possi¬ 
ble ways out. The first entails obtaining a sufficiently sharp concentration 
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inequality for the sequence of estimators {sgjee© at hand. This is often an 
important and difficult question in its own right, even outside the context 
of the problem of adaptive estimation at hand. In certain cases however, 
an alternative strategy might be available. This paper pursues the second 
approach in the context of estimating a non-linear integral functional of a 
density. 

At a high level of abstraction, our method can be described as follows. 
As mentioned earlier, the main challenge in applying Lepski’s Method is 
obtaining a suitably sharp concentration of d q {s, §g) around its expectation. 
However, suppose that there exists another sequence of estimators of a dif¬ 
ferent object of interest s : P -©■ y with y being a pseudo-metric space 
equipped with pseudo-metric and corresponding sequence of estima¬ 

tors sg with concentration of d q (s,'sg) around its expectation and a further 
property that d q (sg,sgi) and r{0,s) equals d g (sg,§g >) and r(0,s) up to mul¬ 
tiplicative constants uniformly in {0,9') € 0 x 0. Our idea is to define 

j := inf |j < K(n ) : d q (s ej ,s ei ) < Cr{6 u s dl ), Ml <E (j,K(n)) j , 

and using sg. as our candidate estimator. Under some conditions on the ac¬ 
tual sequence of estimators sg one then obtains desired adaptation properties 
of the final estimator sg,. These conditions on sg needs to be less demanding 
than sharp concentration of d q (s,sg) around its expectation, and this turns 
out to be the case in our context. 

To fix ideas, for our case the experiment space will be (x,£>(x),P : 
P e Sg)g = ([0,1], £>([0,1]), P = fdn : / € #(/3,C')) jSe(0j 1)5 i.e. parameter 
spaces of interest are H(f3, C ) where /3 is identified with 0 in the discussion 
above and the object of interest is s = f T{f)dfi based on i.i.d. observa¬ 
tions X],... ,X n from density / € H(/3,C). As discussed earlier, the case 
of quadratic functionals., i.e. T(x) = x 2 has been studied in detail in the 
literature. In particular, one has suitable concentration of d 2 (f f 2 d^,sp) 
around its expectation, where d{z\,Z 2 ) = \zi — Z 2 \ for z\,zi € R and 
sr is rate optimal estimator of the quadratic functional. Such an estima¬ 
tor is often a second order U-statistic, for which the desired concentration 
type results can be obtained in reference to Gine, Latala and Zinn (2000); 
Houdre and Reynaud-Bouret (2003). However, for estimating s = f f 3 d[i, 
which is crucial for understanding the general smooth non-linear functional 
problem, the estimator sequence {s^} is a third order U-statistic. In gen¬ 
eral, for estimating s = J f r d //, the estimator sequence {s^} is a r th or¬ 
der U-statistic. In order to apply the standard Lepski’s Method we will 
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need to obtain suitable exponential deviation inequalities for higher or¬ 
der U-statistics. Although, such moment inequalities exist (Adamczak et al., 
2006), the bounds include complicated quantities that must be controlled 
in a problem specific manner. However, owing to Tchetgen et al. (2008), we 
observe that the biases of the estimator sequence {s^} are within multi¬ 
plicative constants of that of the estimator for quadratic functional i.e. {3jg} 
and their variances remain of same order as well. Since we are interested 
in estimation in squared error norm, this motivates us to use j constructed 
using the estimator for quadratic functional and analyzing sp~ as our final 

3 

estimator. 

In Sections 3 and 4 below, we elaborate on the ideas laid down above. 
In particular, we provide a concrete example of the application of standard 
Lepski’s Method while estimating quadratic functional of the density. Subse¬ 
quently, we employ the modification of Lepski’s method as discussed above 
to study adaptive estimation of more general non-linear integral functionals 
of the density. 


2. Notation. It is well known that without imposing restrictions on 
function classes, consistent inference in non-parametric problems is not fea¬ 
sible. Our results are stated in terms of certain regularity conditions on 
the class of densities. We will place the following kind of bounds on their 
roughness or complexity. 


Definition 2.1. A function h(-) with domain [0, l} d is said to belong to 
a Holder ball H(/3, C ), with Holder exponent f5 > 0 and radius C > 0, if and 
only if h(-) is uniformly bounded by C, all partial derivatives of h(-) up to 
order |_/3J exist and are bounded, and all partial derivatives V^A of order 
L/3J satisfy 


sup 

x,x+8x£[ 0 , l] d 


V^h(x + 5x) -V^h(x) 


< CllfelI' 3- LAI. 


With some abuse of notation we will denote by H^/3) the class of func¬ 
tions in H(f3,C) such that \\f\\oo < M* and / m ; n > M* (where / m i n = 
inf^g[o,i] fi x )) f° r known constants M* > M* > 0. Note that, indeed knowl¬ 
edge of C gives us an bound on M*. Our results will be based on known 
(7, M *, M* only to the extent that these quantities will determine multiplica¬ 
tive constants of convergence rates obtained throughout. Adaptation w.r.t. 
these parameters may be of interest, however beyond the scope of this paper. 

A crucial ingredient for constructing our estimators is the use of orthogo¬ 
nal projection kernels onto increasing finite dimensional subspaces of L 2 [ 0 ,1]. 
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The construction of such kernels is based on a suitable choice of an orthonor- 
mal basis of T >2 [0,1]. With this in mind we provide a brief discussion on 
orthogonal projection kernels and compactly supported wavelet bases. 

The kernels (possibly dependent on n) are assumed to be measurable maps 
K n : [0,1] x [0,1] —>• R that are symmetric in their arguments and satisfy 
f f K 2 (x\, x 2 )dx\dx 2 < oo for all n. The corresponding kernel operator 
(which we denote by the same symbol with an abuse of notation) 

K n h(x) = j h(v)K n (x,v)dv 

are continuous linear operators K n : L 2 [ 0,1] —>• L 2 [0, 1]. Throughout we will 
work with kernels whose operator norms ||A" n || = sup {||/\ n /||2 : \\fW 2 < 1} 
are uniformly bounded, i.e., sup n \\K n \\ < 00 . The operator norm ||A n || is 
however typically much smaller than the L 2 [0, 1] xZ/2[0,1] norm of the kernel. 
In our case this will typically be of the order of k given by the dimension of 
the projected space. 

Kernels K n most commonly used in statistical applications are usually 
projection kernels. A projection kernel operator satisfies K n h = h for all 
h in its range space; that is for any function / in the range of the kernel, 
f(x) = f f(v)K n (x, v)dv for a.e. x. For a given orthonormal basis ei, e 2 ,... 
of L 2 [0,1], the orthogonal projection onto lin(ei,...,e k ) is the map K k '■ 
f —> ^2j = i(f, e j) e j- It is given by the kernel operator Kk : ^ 2 ( 0 ,1] —> -^[0,1] 
with kernel 

k 

K k (x 1 ,x 2 ) = y^ j e j (xi)e j (x 2 ). 
i=i 

It is easy to show by orthonormality properties that \\Kk\\ = 1 and f f Kj: = 

k. 

We now provide examples of orthogonal projection kernels that will be 
used extensively throughout. 

2.1. Haar Basis. For “father” and “mother” functions 0 (x) =1(0, l](x) 
and ij)(x) = 1(0, 1/2](a:) —1(1/2, l](x), the Haar basis is the set of functions 
■ i = 0,1,..., j = 0,1,... , 2* — 1}, for ipi t j(x) = 2 l / 2 i>(2 l x - j), 
i = 0,1,... ,j = 0,1,... ,2 l — 1. The Haar basis is a complete orthonormal 
basis of L 2 [0, 1]. The linear span of the first k = 2 1 basis elements {/>,'(//j : 
i = 0, l,j = 0,..., 2* — 1} is equal to the linear span of the scaled 
and shifted father functions {'i/i.j : j = 0,... ,2 1 — 1} given by the kernel 

2-f-i 

K k (xi,x 2 ) = ^ <f>i,j(xi)<f>ij(x 2 ) 
j =0 
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k 


3 = 1 



.7-1 J 
k ’ A; 


/ 




2.2. Wavelets. Consider expansions of functions h £ L 2 (M) on an or¬ 
thonormal basis of compactly supported bounded wavelets of the form 


OO 

&(®) = E E MIMW + EE 


E 




j€jj d t;E{0,l} d 


^=0 t;G{0,l} d -{0} 


( 2 . 1 ) 


where the base functions ^ ■ are orthogonal for different indices ( i,j,v ) 
and are scaled and translated versions of the 2 d base functions i/)q 0 , i.e., 
'ipij(x) = 2* rf ' /2 ^o o(^ l x — j )• Such a wavelet basis can be obtained as tensor 
products ipQ 0 = x ... x 4> Vd of a given father wavelet (j)° and mother 
wavelet q i 1 in one dimension. 

We are interested in functions / with support [0,1]. In view of the compact 
support of the wavelets, for each resolution level i and index v, only 2* base 
elements if’ij are non-zero on [0,1]; let us denote the corresponding set of 
indices j by j,;. Truncating the expansion at resolution level i = I then gives 
an orthogonal projection on a subspace of dimension k of order 2 1 . Let 

i 

K k (xi,x 2 ) = E E ^(*i)^(*2)+EE 

i£Jore{0,l} i=0 j&Ji 

It is worth noting that we can re-express the wavelet expansion (2.1) to start 
from a level I as 


Hx) = E E (MjjWjOe) + E 

je z„ G {o,i} i=i+ijez 

The projection kernel K k sets the coefficients in the second sum equal to 
zero and hence can also be expressed as 


K k (xi,x 2 ) = E E V’/j(*i)^/j(*2). 

j£Jive{ 0,1} 

Owing to the discussion above, with some abuse of notation, we will work 
with the definition of projection kernel as 

k 

K k (x i,x 2 ) = y^^fcj(xi)V>fcj(x 2 ), 

3 =1 
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which will correspond to the orthogonal projection operator onto the first 
0( log 2 k) wavelet basis. Owing to compactness of the support of wavelet 
bases only a fixed 0{k ) many ipk,/ s will intersect [0,1]. We have then con¬ 
veniently renamed them as 1 This simplifies notation without loss 

of generality. In particular, this is exact for Haar kernel since as previously 
noted, we then have that 

k 

K k(x,y) = '52'<Pk,j(x)'<Pk, j {y) 

i=i 

where ijjk,j(,x) = \fkT{x G |)], j = 1 is the log 2 k level Haar 

basis of L 2 [0,1], Also, for larger expressions involved in some proofs, we 
often write il>k,i = V’f f° r convenience. 

With these conventions, let us now discuss some approximation properties 
of these projection kernels. Letting ip := lin{ipk,j '■ j = 1 > ■ ■ ■ > k}, define 

fk(x) := n Lebesgue (/ |V>) := J f(y)K k (x,y)dy. 

For convenience of notation, define 



It is well known that (Hardle et ah, 1998), choosing ipk i’s to be the log 2 k 
level compactly supported wavelet basis with suitable vanishing moment 
conditions on the mother wavelet, one has that 

sup \\h - h k \\ 2 < k~ p . 

h£H(p,C) 

The results in this paper are mostly asymptotic in nature and thus re¬ 
quires some standard asymptotic notations. If a n and b n are two sequences 
of real numbers then a n 2> b n (and a n <C b n ) implies that a n /b n —>• oo 
(and a n /b n —> 0) as n —>• oo, respectively. Similarly a n > b n (and a n < b n ) 
implies that \irainia n /b n = C for some C € (0,oo] (and limsupa n /6 n = C 
for some C € [0,oo)). Alternatively, a n = o(b n ) will also imply a n <C b n 
and a n = 0(b n ) will imply that lirrisup a n /b n = C for some C G [0,oo)). Fi¬ 
nally we comment briefly on the various constants appearing throughout the 
text and proofs. Given that our primary results concern convergence rates 
of various estimators, we will not emphasize the role of constants through¬ 
out and rely on fairly generic notation throughout for such constants. Some 
conventions we follow whenever required is a follows. Throughout the paper 


11 


we denote by C(ip o, ||/||oo) a non-negative constant that depends on a fixed 
tpo (a function that can be taken as a majorant of both father and mother 
wavelets in absolute value) and ||/||oo- Often, C^ 0 will denote a number 
which depends only on ij)Q. Hence if ||/|joo is known to us, such constants are 
deterministic and eventually can all be replaced by an universal constant 
by taking suitable care of all possible constants appearing in this paper. Fi¬ 
nally, we will also use C(ip o, ||/||oo, /min) to denote numbers which depends 
on ipo, II/||oo,/min only. 

3. Lepski’s Method and Minimax Adaptive Estimation of </>(/) = 

/ f 2 d^- 


By way of introduction, this section we will provide a concrete exam¬ 
ple where Lepski’s Method applies in its standard form. For the sake of 
simplicity, we will restrict ourselves to the adaptation over two parameter 
spaces indexed by two smoothness classes /3\ < /3o < The more general 
case can be addressed from our results in Section 3. However, before going 
into further details, we need additional notation and a basic result which 
drives the main idea behind Lepski’s Method. For our candidate sequence 
of estimators, define 


= 


1 


n(n — 1) 


]T K k (X h ,X i: 




k(fi) = |"n 1 + 4 ' 3 ], 
Unj := U { n j \ kj = k(Pj) := 


Up\ kj = k*(pj) = \ 


n 2 \ !+ 4 ^'. 


logn 


It is easy to show that, knowing the exact smoothness /3j one can readily 
use U n j as a minimax rate optimal estimator of /(/). Asymptotic normal¬ 
ity of the estimators, as stated by Lemma C.3, drives Lepski’s Method as 
described in Section 2. Let, 


U n , o := U^ ko \ k 0 = 

U n ,i -=U^\ h = 

n \ ( n 2 \i+k 

u n ,:=ut\k. = r(^) i. 
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The method we now describe, is along the lines of Efromovich et al. 
(1996b) suitably adapted to deal with our situation. Our goal is to choose a 

data dependent k G {ko,k\} so that if we base our estimation on Un\ the 
mean squared error of estimation will be suitably controlled. In particular, 


u n3 -</>(/) 


1 2 


supE 
/ 

< Y SUPE | 1 {■?’ = k } U n 3 ~ ^ (/) 

fc=0 f '■ 

= supR 0 (/5f) + swpRi(/3f) 
f f 


1 2 


We divide our study into two cases having two sub-cases each. 

First, consider the case (3 = j3o- Then, Rq(/3q) achieves the minimax rate 

since the truth is /3q and we are choosing (3q. Next we need to bound Ri(/3o) 
__8£o_ 

by n 1+43 o up to a potentially logarithmic factor. It is worth noting, that 
along the lines of Efromovich et al. (1996b), since /?o corresponds to higher 
smoothness regime, we do not need to pay the logarithmic price. The term 
R\(/3o) corresponds to the scenario where the truth is j3o but the procedure 
wrongly chooses f3± as the underlying smoothness. If this happens too often, 
then the mean squared error will be sub-optimal. Therefore, our testing pro¬ 
cedure should guarantee that type of error, i.e., choosing a lower smoothness 
when the truth is a higher smoothness, does not happen too often. Since, 
the probability of making such an error must reduce the mean squared error 

Sffo 

n i+ 4 /3i down to n 1+4/3 o for any f3\, it must be of O(-). So the problem 
boils down to designing a selection procedure so that probability of select¬ 
ing the lower smoothness is 0(—). If one can find a sequence of statistics T n 
such that T n converges weakly to N(0,1) ’’fast enough” under /3o and \T n \ 
diverges polynomially in n under /3i, then a simple test for selecting the 
lower smoothness with required error rate is given by T(T n > \J 2 log n). To 
see this, note that, the probability of selecting the lower smoothness under 
/Jo is approximately Pr(|iV(0,1)| > y/2 log n) = O(^). Provided the approx¬ 
imation in the CLT is also of the order of y L we have achieved our goal for 
this first case. 

Considering the complementary case of f3 = /?i, we will need the error 
probability of selecting the wrong smoothness Pr{\T n \ < -^/21og n) to con¬ 
verge to 0 “polynomially fast”. As we establish below, this can be achieved 
with the right choice of test statistic. Finally, by Lemma C.3, a candidate 
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T n which satisfies the above mentioned properties is given by 

Un,* U n o 
yJVarf{U nt *) 

We now provide the exact technical details that justify the argument laid 
down so far. 

We start with a few further notations. Define 

I(h,k 2 ) ■= 4>(fk 2 ) ~ 0(Ai) h < k 2 , 

corresponding to the difference in truncation bias at ki versus k 2 . With this 
notation, it is easy to see that I(ko , k *) = O(k 0 when the truth is (3f. A 
suitable estimator of I(ko,k*) is indeed 

I{ko,K) — Un,* U n ^Q. 

We then have the following theorem. 


Theorem 3.1. For a deterministic constant C depending on the parent 
wavelets, we have the following result. Let 

j =1 (j(k 0 ,h) > C^/log 


and let 
Then 


= U 


n,j 


sup E f dp- 4>(f) 
/6R(/3 0 ) L 


1 2 


< n l+4/3 0 


sup E f 


1 2 4gi 8 gj 

(j) — <t>(f) < (log n ) 1+4/J i n 1+4f3 i 


The proof can be found in the appendix. However, it is worth mentioning 
that the two crucial ingredients of the proof is control of the error of testing 
at a suitably vanishing level and bounding higher order central moments of 
the non-adaptive minimax estimators dependent on non-random truncation 
level k. These properties are both derived from exponential tail inequalities 
for second order U-statistics using results of Gine, Latala and Zinn (2000); 
Houdre and Reynaud-Bouret (2003). In particular, this is a typical structure 
for most applications of Lepski’s method i.e. the treatment crucially relies 
on some exponential inequalities for the deviations of the sequence of non- 
adaptive minimax estimators (Birge, 2001). For estimation of general non¬ 
linear functionals this poses a challenge. The next section is devoted in 
understanding a possible way out. 
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4. A General Result by a Modification of Lepski’s Method. 

The results of the previous section strictly hold only for the quadratic func¬ 
tional (j)(f) = / f 2 dn so far. This is because, we have crucially used a devi¬ 
ation inequality which only holds for degenerate second order U-statistics. 
The minimax estimator for </>(/) = f f r d\i is in general a r th order U- 
statistic. In order to apply standard Lepski’s Method we will need to ob¬ 
tain suitable exponential deviation inequalities for higher order U-statistics. 
Although, such moment inequalities do exist (Adamczak et ah, 2006), the 
bounds include complicated quantities which needs to be controlled in a 
problem specific manner. However, owing to Tchetgen et al. (2008), we ob¬ 
serve that the biases of candidate estimators are within multiplicative con¬ 
stants of that of the estimator for quadratic functional and the variances re¬ 
main of same order as well. Since we are interested in estimation in squared 
error norm, this motivates us to use the idea discussed in Section 2 to con¬ 
struct our final estimator. We now explain this below. 

For a given choice of d > 1, let A be the largest integer such that d N ~ 1 < 
n logins™. Set kj -1 = d J l n for j = 1,..., A. This holds if and only if, (A — 

!) ^ Up ( X - IHifel) and also that $ = °(£)- Now ’ for j = 0,..., A - 1, 

2 

define /3j to be the solution of kj = n 1+4 ^'. This implies that 1 + 4/3j = 
(j_i) 2 iog g j+io g n i 3 = 0, ...,A - 1. Also, note that by construction, k 0 < 
k\ < ... < kpf-i and therefore, fio > j3\ > ... > /3/v- 1 - Also, a simple 
calculation shows that, (3n~i > 4 iogiogn-i • Therefore, | > fio > /3\ > ... > 
/3jv-i > TITToUru - Further, note that for all 0 < li < I 2 < A — 1, there 


exists constants ci, C 2 such that /3q 


A 2 € 


ci 


h-h c h—h 

logn ’ ^ logn 


. Further, define 


k *j = 


, 1 + 4 / 3 ; 

log n 1 


1 + 4/3 


fc* 


= () 0 and R(k*) = This definition implies that 


for li > I 2 , pr = ( 1 + 4 /3/i )C 1 +4/3/2) oq. However, note that jR- might 

not enjoy similar properties for certain ranges of l± > I 2 ■ Especially, for j3 
values within rate of each other, the corresponding k values do not 
enjoy the above mentioned property. Let s* = s*(n ) be the smallest integer 
such that, k(/3 s *)* > n. Finally, define 


j := min j j : P(k*,kJ) < C 2 pt log nR{kf) V l > j, s* < j < A - l} . 

where C^pt is a deterministic constant to be specified later and similar to 
Section 3, I(kj,ki ) = Uk t — Uk J for kj < ki. 
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Theorem 4.1. Suppose an estimator U Ut k of <j>(f) satisfies the following 
properties. 

1. For sufficiently large k,n and fixed choice of q > 1, 


SUp E? (U n:k - Ef(U nt k)) 


2 q 


<c,fi 


2. There exists a constant C such that for any f € H(/3,C), 


I c (ki,k 2 )\ <CI(k u k 2 ), 


for sufficiently large k\ < k 2 where I c (ki,k 2 ) = E/(t/ n ,fc 2 ) ~^f(U n ,k 1 ) 
and by our previous convention I(ki,k 2 ) = f f% — f /| . 

Then for any (3 € (0, and e > 0 , we have that 

8/3 4 (/3+e) 

< n i+4/9 log n 1 + 4 (/3+ e ). 


sup E f 
fGH(B.C) 


Un,k* 

i 


Some remarks are in order about the implications of the theorem above. 
It says that if a sequence of estimators, having common index with the 
minimax estimator of the quadratic functional, has bias and higher order 
moments of the same order of magnitude as the quadratic functional es¬ 
timator, then following the discussion in Section 1, the new sequence of 
estimators also shares the same rate of convergence (up to almost match¬ 
ing logarithmic factor) towards its corresponding functional simultaneously 
over the functional spaces which dictates the adaptation of the quadratic 
functional. The j3 + e appearing in the logarithmic factor, is due using a 
different test statistic than JJ n as dictated by Lepski’s Method. Establish¬ 
ing an equivalence in the order of the bias for these functionals requires 
fairly standard arguments (Tchetgen et al., 2008). The main challenge lies 
in obtaining suitable bounds on higher order moments of these estimators. 
This requires control of moments of higher order U-statistics based on or¬ 
thogonal projection kernels. We crucially use the structure of the projection 
kernels based on compactly supported wavelet basis. Following ideas from 
Robins et al. (2015), we implement a binning argument to keep track of 
membership of sampled observations in partitions of the sample space cre¬ 
ated by a particular resolution level of the wavelet expansion, and this turns 
out to be crucial in controlling the higher order moments of our estimator 
at the right level. 
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5. Minimax Adaptive Estimation of j f 3 dfi. 

As mentioned earlier, based on the technique of Birge and Massart (1995) 
one can show that the minimax estimator of a general non-linear functional 
4>(f) = j T(f)dfi for smooth T can be constructed by using ideas from linear, 
quadratic and cubic functionals of the density and appealing to a standard 
Taylor expansion argument. While producing adaptive estimators of non¬ 
linear functionals as well, a similar strategy can be followed and it becomes 
crucial to understand the adaptive estimation of f f 3 dfi. In particular, a 
non-adaptive minimax rate optimal estimator of 4> (/) = f f 3 dfi is given by 


^fcl,fc2,/C3 ^Tl X ki,k 2 ,k3 ( X il ; X i2 , X i 3 ) 


= v n 

+ 3V n 
+ 3 V n 
+ 3V n 
+ V n 


J K kl (x, X h ) K kl (x, X i2 ) K kl (x, X i3 ) dx 

K k ^ (*^5 K-ii) (K k3 (x, X 12 ) K kl (x, X { 2 )) K k1 (x, X { 3 ) dx 


K kl ( x i X ii) (Afc 3 (x,Aj 2 ) Afci (x, Aj 2 )) (Aj, 3 (x, Aj 3 ) K kl (x,Aj 3 ))dx 

dx 
dx 


(K k2 (x, A^) -^fci (^r, A^)) (A/;, 2 (®,A ia ) (x, A^ 2 )) 

{Kkz (. x i X iz) — Afc 2 ( x jKi 3 )) 

(K k2 i x i X i 1) — A)^ (x.A*)) (Afc 2 (s,A ia ) — K kl (*,A ia )) 
{Kk 2 (x, Aj 3 ) K kl (ic, )) 


where = k\ (/9) ~ n, n^ 3//2 2 ^)/( 1+4 ^) < = ^2 (/?) < n (3/2+2/3)/(i+4/3) 

and ^3 = k% (/9) ~ n 2/(i+4/3) and V„ (/i(Aq, A,; 2 , Aj 3 )) corresponds to to the 
U-statistic based on h i.e. V n (h(Aq, Xj. 2 ,Xi 3 )) = n(n _^ )(n _ 2) K x n^ x i2 , X i3 ) 

* 17^2 ^*3 

see Tchetgen et al. (2008) for more details. By Tchetgen et al. (2008), the 
bias of this estimator is given by 


= f ( fk 3 - A- 2 ) 3 + 3 J (f k3 - f k2 ) 2 (f k2 -f kl ) + J (/ 3 - /| 3 ) 

(5.1) 


which is dominated by 


/ (/ 3 - /!,) ~ ^ < 


,^3 


<<?-*. 

n z 


Now note that 
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sup 

f&H(0,C) 


I E /(01.(2) 7,(2) 7,(2) 
’^3 


E f (0 


k W 7.(1) 7.(1) I 

A;^ . Ay 2 « Ay 3 



(5.2) 


whenever fcW = ^A:^, < /c® = ^k[ 2 \ k^ 2 \ where we say 

v\ < t >2 for two vectors tq, t >2 € whenever max v\ < max r >2 . As shown 
by Tchetgen et al. (2008), the variance is also dominated by ■%. Therefore, 
similar to the discussion in Section 2, the bias and variance of the candidate 
estimator is similar to that of the estimator of the quadratic functional of 
the density. We will use this fact to produce an adaptive estimator. The next 
result is the main result of this section and helps us apply Theorem 4.1 to 
construct an adaptive estimator of = f f 3 dp,. 


Theorem 5.1. For any < &2 < k 3 as above, one has for all suffi¬ 
ciently large n, 

% {$k u k2,k 3 -% (<^ci,fc 2 ,fc 3 )) q - C '(IIV’o||oo, ll/lloo, /min) (J^\ , 

whenever the projection kernels in the construction are based on the Haar 
Basis. 


Now combining (5.1), (5.2), Theorem 5.1 with Theorem 4.1, we can con¬ 
struct an adaptive estimator of cubic integral functional as follows. 


We use the ideas laid down heuristically in Section 2 and detailed in 
Section 4. In particular, fix a d > 1. Let N be the largest integer such 
that d N ~ l < n 1 ” 10 ® 10 ®". Set lj-\ = d J_1 n for j = 1, ...,1V. Now, for j = 

2 

0,... ,1V — 1, define ff to be the solution of lj = n 1+4/3 ? . Further, define 
l* = - — 1 — = ( -j-^—^ 1+4/3j and R(l*) = l -f. Let s* = s*(n) be the smallest 

3 logn 1 ^ V1 ° g?V 3 

integer such that, l(/3 s *)* > n. Finally, define 

j := min j j : I 2 (l*,l* m ) < C 2 pt log nR{k* m ) V m > j, s* > j < N - lj . 

where C op t will be a deterministic constant and similar to Section 3, 1(lj,l m ) = 
u im - u ij for lj < lm • Now let 

1,1 ,/c2. k 3 J 

where k\ ~ n, k 2 ~ n 1+4,3 i and k% ~ l*. 
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Theorem 5.2. Consider projection kernels based on Haar wavelets. The 
estimator of <f>(f) = f f 3 satisfies for any ft G (0, -j) and e > 0. 

S/3 4Q3 + 0 

< ji 1+4/9 jQg fi 1+4(/9+e) 

for some deterministic C op t (depending only on parent wavelets and ||/||ooj- 

Remark 5.3. For the proof of Theorem 5.1, the requirement of Haar 
wavelet arises at one specific instance in the proof. We expect that similar 
results continue to hold for more general compactly supported wavelets us¬ 
ing arguments developed herein. However we do not further consider other 
wavelet bases. 

6. Minimax Adaptive Estimation of / T(/)dp. 

We now discuss adaptive estimation of a more general integral functional 
of the density. As mentioned earlier that such functionals arise naturally in 
information theory. In particular, a large body of work has focused on esti¬ 
mating the Shannon entropy (Beirlant et al., 1997) and more recent works 
include estimation of Renyi and Tsallis entropies (Leonenko and Seleznjev, 
2010; Pal, Poczos and Szepesvari, 2010). As mentioned, we will use ideas 
developed so far for adaptive estimation of up to cubic functionals to come 
up with an adaptive estimator of smooth non-linear functionals. We only 
sketch the idea here and omit technical details. In particular, suppose that 
T (/) admits an expansion around fo such that 

T (/) (x) = T (/ 0 ) (x) + (/„) (x) (/ (x) - f 0 (*)) + T (2) (/o) (/ (*) - fo (s)) 2 

+ T® (f 0 ) (/ (x) - fo (x)) 3 + O ((/ (x) - fo (*)) 4 ) 

as / (x) —>• fo ( x ). The idea now is to sample split and obtain an adaptive 
estimator / (x) of / (x) so that (x) — / (x)^ = O p (n _2/3 /( 1+2/3 )). Next, 

<Kf) = J T (/) (x) + J (/) (x) (/ (x) - / (x)) + J T^ 

+ Jt® (/) (/(x)-/(x)) 3 + o(| (/(x)-/(x)) 4 ). 

Note that, f (^f (x) — / (x)^ = Op (n -8/3 /( 1+2 ^) = o p (n -8 ^/( 1+4 ^)). There¬ 
fore, we need only learn how to adapt to functionals of the form f g\ / (x) + 

J 92 (f) f ( x ) 2 + f 53 (/) / (®) 3 • The linear functional estimation theory is 


sup E f 
feH(g.C) 
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well understood and the quadratic or cubic terms are rather straightforward 
generalizations of our statistics. For the quadratic term for instance, use 




192 


( X) K k ( X , X h ) K k (x, X h ) dx 


instead of the one used earlier. 


7. Adaptation Lower bound for Estimation of j T(f)dfi. 

In this section, we implement ideas similar to Efromovich et al. (1996a) to 
provide a lower bound on the required price to be paid for adaptation over 
/3 < Efromovich et al. (1996a) proved that, while estimating a quadratic 
functional of the density, if an estimator achieves a parametric rate for ft > \ 

4g 

then it must incur a heavier penalty than (logn) 1+4d for (3 < 4 . Here we 
provide a result of similar flavor regarding constraints on estimation rates 
at two points j3 \, fa < \- 

We begin by describing the main tool in our proof, which is a general 
version of constrained risk inequality due to Cai et al. (2011), obtained as 
an extension of Brown et al. (1996). For the sake of completeness, begin with 
a Summary of these results. Suppose Z has distribution Pg where 6 belongs 
to some parameter space 0. Let Q = Q(Z ) be an estimator of a function 
Q(9) based on Z with bias B(9) := E^Q) — Q(9). Now suppose that ©o and 
©i form a disjoint partition of 0 with priors ttq and 7 Ti supported on them 
respectively. Also, let Hi = f Q(9)dni and af = f ( Q(9 ) — Hi) 2 diTi, i = 0,1 be 
the mean and variance of Q{9) under the two priors 7 To and tt\. Letting 7 * be 
the marginal density with respect to some common dominating measure of 
Z under 77, i = 0,1, let us denote by E 10 (g(Z)) the expectation of g(Z) with 
respect to the marginal density of Z under prior 7 To and distinguish it from 
E g(g(Z)), which is the expectation under Pgi. Lastly, denote the chi-square 

divergence between 70 and 71 by x = |® 7 o ~ 1^ | • Then we have the 

following result. 

Proposition 7.1 (Cai et al. (2011)). If J E 0 (q{Z) - Q(9)^j dn 0 (9) < 
e 2 , then 


| j B(9)dni(9) - J B(9)dTr 0 (6)\ > \m - /r 0 | - (e + a 0 )x- 

Since the maximum risk is always at least as large as the average risk, 
this immediately yields a lower bound on the minimax risk. In order to use 
this result, we will need to produce suitable priors on appropriate parameter 
spaces, so that we capture the price needed for adaptation. Since the uniform 
density /o = T( 0 , 1 ) is in all the smoothness classes we consider here, we take 
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0 O = {fY} an d the prior 7To = 6 („), the dirac mass at the uniform density. 

Jo 

Here and below f( n ' ) refers to the joint density of X\..... X n which are 
i.i.d with density /. Now let us construct the alternative parameter space 
along the lines of Efromovich et al. (1996a), which in turn relies on Ingster 
(1987). Take h to be a function supported on [0,1] such that f h = 0, 
j h? = c for some c > 0 to be specified later, h £ H(l, C), and 1 + h > 0. 
Let v n be an increasing sequence of positive integers, to be specified later 
up to multiplicative constants, and denote by a the vector (ao, • • •, a„„~i) € 
{ — 1, +1}^. Now define 

Vn~ 1 

fa = fo+ Y a i v n P K v nX ~ i), 
i =0 

and let ©i = {f^ : a € { —1,+1}' L ' 71 }. First,note that by construction one 
always has fa € H(/3,C). Finally let be the uniform prior putting 
mass at each point of @i. In order to apply Proposition 7.1, we evaluate the 
following quantities. Using the notation introduced earlier, 


Mo = J T(f 0 (x))dx = T( 1) 


and ao = 0. Also, 
1 


Vn~ 1 


Mi = 


— ^2 / T ( fo(x) + Y aivjh(v n x - i) ) dx. 


aef-l.+l }’'’ 1 


i =0 


Therefore 


1 


V n — 1 


i =0 


IM 1 Mol ^ v 

3G{—l,+l}“ n 

Now by Taylor expansion, for each a, 
j T ^ + a i v f ,6 h(y n x - i)J - T(l) 

Vn 1 p 

= T'( 1) ^2 a i v f2 / h(v n x — i)dx + T"(l) 

i =0 ' 


T 1 + Y a i v n^h(v n x - i) - T(l) 


+ J T"\f{x)) 


Vn 1 


Y aiv n ^h{v n x - i ) 


L i =0 


dx. 


Vn— 1 
Y <HVn fi h(\ 

L i= 0 


-i 2 


v n x - l) 


dx 


dx where ||£ — l||oo < 1 
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> C T v n 2/3 

for some constant Ct > 0 depending only on T. The above calculations 
follow by the inclusion of the support of h in [0,1], boundedness of h and 

_o o 

standard change of variables. Therefore, |/ii — hq\ > Cv n p as well, where 

we have dropped the subscript T for notational convenience. Now suppose 

43 * 

that the bias at /o is smaller than e n < f 4/3 +1 for some /3* < 

Then for any for any (3 < /3* , we have by proportion 7.1 that f B(6)diri(9) > 

2 

Cvn 213 — £nX • Therefore, if we can show that for v n = ^ e " gn ) 1+4,3 one has 
X <C n c , then one gets the desired rate for the incurred penalty by choosing c 
small enough. This can be derived following the arguments in Ingster (1987). 
Therefore, we have sketched the proof of the following theorem. 

Theorem 7.2. Suppose one has 


sup E f 


2 


< 



43 * 

1+43* 


for an estimator tp of p(f) = f T(f)dfi with T having bounded third deriva¬ 
tive. Then for any /3 > /3*, 


sup E f 
/etfCS) 


2 


> 



8. Discussions. Our results are all provided for one dimension d = 1 
corresponding to a low smoothness regime f3 < Although we do not 
provide explicit details, our proofs can be easily extended to incorporate the 
case of d > 1 and to consider sjn rate of convergence of the constructed 
estimators for /3 > |. The case of (3 = f is a little more subtle; however 
this also can be done by adding one more level of discretization near the 
truncation level k = n. Since, the crux of the arguments remain the same, 
we do not elaborate on such proofs here. 

Following the idea of possibly modifying Lepski’s method to include a 
larger class of mechanisms to choose a tuning parameter for adaptive estima¬ 
tion, another potentially interesting question is to investigate which method 
provides a better performance- either in a finite sample sense or regarding 
constants of asymptotic mean squared error. Finally, the study of adaptive 
non-parametric divergence and entropy based on more than one densities is 
also of interest. Since, the quadratic functional estimator is a U-statistics 
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is based on a bounded kernel, the results of Houdre and Reynaud-Bouret 
(2003) were readily applicable for deriving a suitable exponential tail bound. 
It is worth exploring how to modify such arguments for second order U- 
statistics estimators that arise in context of non-parametric regression prob¬ 
lems and fail to be bounded in case of unbounded error distributions. We 
keep the study of such questions for possible future research. 

APPENDIX A: PROOF OF THEOREMS 

Proof of Theorem 3.1. 


Proof. Recall that, 


sup E U *■- 4> (/) 

/ L 
i , 

< J^supE ll{j = A;| 

k =0 f '■ 

= sup R 0 (/3f) + sup Ri(/3f) 

f f 


UnS-W) 


1 2 


Higher Smoothness (/3f = /3o). 
(i) First let us consider, 


SUpi? 0 (A)) = SUp Ef l(j = 0) (U n fl — <j>{f)Y 
f feH(j3 0 ) L 


< sup E f 
f£H(p 0 ) 


{Un ,0 - 


< yi 1+4/3q 


by our choice of U n% o- 

(ii) Next we consider supf eH (p 0 \ R,\ (J3 q). For this part, we will bound 
i?i(/3o) from above uniformly in all / £ H(/3q). First note that by Holder’s 
Inequality, for any conjugate pair (p, q ) we have 


R 1 (Po)=E f X(j = 1) (U nA - 


< 


{P/ (iQ = 1)) Y {% [(^,1 - <K/)) 29 ]} 


__SPo_ 

The proof of supf eH rp 0 \ Ri((3q) < n 1+4 ^o now follows by the following two 
lemmas. 
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Lemma A.l. For kernels based on compactly supported wavelet bases, 

sup F f (l(j = 1)) < ^ 
feH(flo) v ) n 


Lemma A.2. For kernels based on compactly supported wavelet bases, 


sup 

/£//( A,) 





Lower Smoothness (/ 3f = /3i). 
(i) First note that, 


sup R] Oi) 
/ 


sup 


E, 


Aj 


1) (U n , 1 - <K/)) 2 


< 


sup 

/6tf(/3i) 


E, 


(£4,1 ~ 


<«/)) 2 


ggl 

< n 


by our choice of U n j. 


(ii) Now, 

Ro(Pi) = E/ (l {i 2 (ko, k\) < C In (n) [U n , 0 - 0 (/)] 2 ) . 

We consider three cases as below with v n = ln(n) _ 1 /( 8 ^i+ 2 ). In the calcula¬ 
tions below C,C ,C ,C are arbitrary constants which can be chosen for 
the calculations to be valid and can change from place to place. Also for 
controlling higher order moments of related second order U-statistics below, 
we use results along the lines of Lemma A.2 with obvious modifications to 
the proof. 

Case 1: I 2 (k 0 , k£) > C' (1 + i/ n ) In (n) ^ 

In this case we have, 


RoW) 

= E f (lS^iko,^) <Cln(n)^| 


1 <J 1 2 (ko, kl) > C' (1 + v n ) In (n) % 

n z 


[U n , 0 


< M /)] 2 
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< % ((l {i (fco, K) - I(ko, k\) > (c'V n ) I (ko, A*)} [U n ,o - 0 (/)] 2 ) ) 

< C'"' (i/ n 7 (ko, K))- 2 E f ((/ (fc 0 , fcj) - 7 (ko, kl )) 2 [£4 i0 - 0 (/)] 2 ) 

< C'" (u n I (ko, k\))- 2 K 1 / 2 ((/ (k 0 , k{) -I(ko, /cD) 4 ) Ef ([£/ n , 0 - </> (/)] 4 ) 

< C'"' (v n I (k 0 , kl))~ 2 (n~ 2 kl) (j (k 0 , kl) + (kl)- 2lh ^j < C'" (v n )~ 2 (n~ 2 kl) 

/ 2 \ l/(l+4/3i) 8/3 

= C'" In(n) 1 /^^) ( ) n - 2 = c/V^+^n - 2 = C/'V TO 

\ ln (n)y 

Case 2: I 2 (ko,kJ) < C£+ 

In this case, 

E/ ([E/ n , 0 - 0(/)] 2 ) = A: 0 - 4/31 + ^ x I 2 (koM) + ^ < (/n'WI. 

Case 3: I 2 (k 0 , k{) < C (1 + i/„) In (n) and I 2 (k 0 , k\) < C^ 

In this case, 1 2 (£'o, kl) < (1 + u n ) In (n) and 1 2 (fco, k \) > %, so that 
E/ ([£7n,o - ^ (/)] 2 ) < C 1 ' (V (ko, kl) + (kl)~ 4Pl + C" (i 2 (ko, kl) + (kl )~ 4/31 + 

< d" In ( n ) 1 - 1 /( 1 + 4 ^ 1 ) n -8ft/(l+4^) = C "' n - 8/3 1 /(l+4/3 1 ) ln (n) 4^/(l+4^) 

□ 


Proof of Theorem 4.1. 


Proof. Suppose /3 € (/3 J+ i, fi 3 ]. Let l c (C*) be the largest l such that 
I 2 (kl,kl) > C* log nR(kl). Let us explain the reason for existence of such 

I 2 (k*,k*) 

an l c (C*). Note that it is enough to show that the ratio c . Iog nR(k*) i s U P“ 
per bounded by a quantity that l increases through positive integers. By 
our choice of /3 € (fij+i, /3y ], we have that g* log nR^*) mos t a constant 


times 


(k*)~ 4p o +1 


k* 

J 

TV 


The power of n in this ration is 1+4 ^. 


8 /%+i 

1+4/3; 


which indeed 


decreases as l increases. Also, note that trivially l c (C*) < j. Also, by defini¬ 
tion of l c (C*), for any l < l c , I 2 (kl,kj) > C* log nR(kj) and for l > l c (C*), 
I 2 (kl,k*) < C* log nR(kj). Our proof relies on the fact that the testing pro¬ 
cedure does not select an index l < l c (C*) or l > j +1 with high probability. 
This is captured by the following two lemmas. 


to | O 
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Lemma A.3. For kernels based on compactly supported wavelet bases, 

C'l/JQ 


sup P f[j = l) < 


n 


for any l < l c (C*) whenever C* > 4 C% pt . 

Lemma A.4. For kernels based on compactly supported wavelet bases, 

C^ 0 

5 

n 


sup P f [j >j + 2 ) < 
feHQ3) 


whenever C* > 4 C^ pt . 

Let us first complete the proof of Theorem 4.1 assuming the validity of 
Lemmas A.3 and A.4. Note that, 


E, 


U n ,k* 

j 


TV —1 

= E E / 

l=s* 

= t 1 + t 2 + t 3 , 
2 


I(j = I) U n , k * - 


m = i) u nikf - 


Aj = I ) U n>kr - 


3 +1 

T 2 = E E/ 

Z=L(C *)+1 


(A.l) 

X(j = o ( u nM 


In the following we control 


lc{C*) 

where T\= E E y 

l=S* 

N -1 

and T 3 = E E / 

l=j-\-2 

the terms T k , h = 1 , 2 ,3 individually to show the desired result modulo the 
proofs of the above two lemmas. We then finish the proof by proving the 
lemmas. For the following let (p, q ) denote a pair of positive real numbers 
such that 4 + | = 1. The specific choices of the pair will be clear from the 
proof. In particular, we will always choose q to be an integer and p will be 
sufficiently close to 1 . 

Control of T\. 

lc(C*) 

Ti = E E / 

l=S* 

lc(C *) , 


l(j = l) [Unfit - 


< £ p; (3 = 0 E 

l=s* 
lc(C*) 

£ E 


U n .k* 


l=s * 


c, 


V’o 


n 


k 


n* 


+ (*?) 


* \ —2/3 


2 g 


< log n 


a 


V>0 


n 
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where the second last inequality above follows from Lemma A.3 and con¬ 
ditions of the theorem, and the last inequality follows from the choice of 
N < logn. Therefore for p sufficiently close to 1, we have desired control 
over T\. 


j+1 

t 2 = y E / 

l=lc(C *)+1 


Control of T 2 . 

Aj = i) (u n , k * - 


j+1 - r \ ± 

< y p /($ = 0 E / 

Z=L(C*)+1 


J+1 

< c, E p 

J=L(C*)+1 




E 


7 


U n<k ? — Ef ( u, 


+ E / 


< 


E / E / 

c» E i*|(i = '){S + - f '( t i+?) + § + (t,T 2 ' 5 } 

i=L(C*)+l L J 


/ l u n,k* 
2? 


U n ,kT ~ 


2 q 


2 q 


+ E / 


U n ,kj 


l=lc(C*) +1 
J+ 1 


k 


n 


<C, E p H j = 'L5 +/2( T* ! ‘' ) + 

z=L(C*)+i L 

i +1 i / N 

< C g C* log nR(k*) Y F f (j = l ) ~ C i C * l ogn 1+e R(k*)- 

l=lc(C*) +1 

Above, the fourth last inequality follows from condition 1 of the theorem, 
the third last inequality follows condition 2 of the theorem and using the fact 


2 A~ 2 A+1 


< 


that f3j + \ < (3 < f3j (which implies that ( k *) 2/3 < ( k *) 2/3 ? 1+4,03 

k* C k* 

C^n l °s n < C'-?). The second last inequality follows from our choice of 

J+ 1 \ 

l c (C*). The last inequality follows since P/ (j = /) < 1, we have 

i=i c (C *)+1 v 2 

J+ 1 I /, \ 

that for p sufficiently close to 1, one also must have that P? (j = l) < 

l=l c (C *)+1 V 2 

8/3 4 p 

log rf. Finally noting that CqC* log nR(k*) < n 1 ++ 3 logn 1 + 4 ' 3 completes 
the required control T 2 . 


Control of T3. 

iV-1 

r 3 = E E / 

l=j +2 


X(j = 0 U n> kf - 


2g 
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< 


JV - 1 I / x 1 

E p / (j = 0 E / 


2 




TV-1 


S E 

1=3+ 2 


n 


p 




+(*?)- 2,s 


< logn 



i 

V 


where the second last inequality above follows from Lemma A.4 and con¬ 
ditions of the theorem, and the last inequality follows from the choice of 
N < logn. Therefore for p sufficiently close to 1, we have desired control 
over T\. □ 


Proof of Theorem 5.1. 


Proof. The proof borrows ideas from Robins et al. (2015). However, the 
computations are much more cumbersome since now we control moments of 
a third order U-statistics. 

We decompose U n ,\ as, 

1pki,k2,k3 — bri,l T Rn ,li 


E, 


where 

V n ,l = 

and 


1 f ~ Mn 

7 777 77 ^ y S K(XiiXj,X s )I((Xi,Xj,X s ') € (I Xn,m X Xn,m X Xn,m ) 

n(n — 1 )(n — 2 

i^j^s K m= 1 


Rn,l — Pk\,k2,k3 bn, !• 


The sets Xn^m^m = 1,..., M n are constructed according to Robins et al. 
(2015). Therefore, for Haar basis in one dimension, we can take M n = n and 
Xn,m = ^) for m = 1,..., M n . Therefore, 


\ 2<j‘ 

_ r r 

J 

< C g [E f [ 


\2q 


+ E 


which 


Below, we only control the first summand i.e. Ej (V^i — Ej(V) 1i i)) 2,? 
suffices for Haar basis. However, for non-Haar bases this is not sufficient. We 
believe, that for non-Haar bases one can show that the second summand has 
sub-optimal rate. 


















28 


Control of E f [(V„,i - E f (V^)) 211 

Write Vn.i = J2m= 1 v n,m with 
1 


Vn..m. — 


n(n — 1 )(n — 2) 


£ K(X i ,X„X,)I((X l ,X j ,X,) 6 Xn,m*Xn ,m X X.n ,m) • 




Following Robins et al. (2015), we define the following membership based 
quantities. Let, 

I n ,r ■= m if X r € Xn,m, V = 1, . . . , 71 , 771 = 1, ... , M n , 

(j" ' In,r — 771 ) . 

Then, (N n ^ m , 1 <m< M n ) ~ Multinomial (p n m, 1 < m < M„), where 


Pn,m — 


f f (x)dx. 

J X n .. m. 


Note that p n ,m € [Ijaia, where f m ; n = rriin f(x). Given the vector 

I n = (/ n ,i) ■ ■ • , In,M n )-> the observations X\,..., X n are independent with 
distribution of X r \I n , r = m being Ix ™'™ dt , Now, 


Pn, i 


E, 


(y^i-E/^r)) 29 




E, 


(Kp-E/^ilT ,)) 29 


Control of Ef 

Now, note that, 


+E/ [(E f (V nA \l n )-E f (V n!l )) 2<1 
(Ef(V n ,i|I n )-E f (V n ,i)) 2q 




MVnAlr.) ~ MVn,A = £ OW,.™ 2 > _ j , ^ 


m=l 


n ( n — l)( n — 2)Pn,r. 


where a n ^ m = f f K(x 1 ,x 2 ,x 3 )l((x 1 ,x 2 ,x 3 ) € Xn,m^Xn,m^Xn,m)d{F(x 1 ))d(F{x 2 ))d{F{x 3 )). 
The next lemma provides control of the terms OL n ,m- 

Lemma A. 5. For compactly supported wavelet bases one has 

, | . o, 

max \a n m \ < -— 

m M, 
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Therefore, we have 1 l a n,m| < C^ 0 . Now, since N n , m 's are negatively 
associated (see definition in Joag-Dev and Proschan (1983)), we have by 
Theorem 2 of Shao (2000) followed by Marcinkiewicz Zygmund inequality 
that, 


Ef 

= E f 


(E f (V n:1 \l n ) -E f (V nA )) 2q 


M n 


f JVn : m(N n , m - l)(iV n , m - 2) \ 

n(n-l)(n-2>3 m 


2 q 


M 2q 

n 


_ 2 q l 


Mr, 


< C q Ml«M n J w Y, E 


f 


M n 


m= 1 
1 2 q 


N m (N m - l)(JV n , m - 2) _ ^ 29 
n(n- l)(n-2)p^ m 


a: 


2 q 

n,m 


< C* q M q - 1 £ K, m | 2(? (A.2) 

m= 1 

M n 

< C*M q - 1 {m&x(a n , m )} 2q ~ 1 ^ |a n ,m| < C'^ o M 9 _ 1 {max(Q ; ri!m )} 29_1 


m= 1 


In the above display, equation A.2 follows from Lemma 5.6 of Robins et al. 
(2015). Now, by Lemma A.5, {max(a nim )} 2,?_1 < {C{ip 0 , ||/||oo)) 29_1 


(Ef(V nA \I n )-E f (V n , 1 )) 2q < (C^H/IU) 29 " 1 ^ 


< (CVJ/Hoo) 29 


(V n ,i-E f (V„,i|I n )) 2q 


Therefore, Ef 
for our choice of M n x n. 

Control of Ef 

Suppose, 14,m = 14 ( ,m + Vn% + (K,m|In) denote the Hoeffding de¬ 

composition of V n , m w.r.t the conditional distribution given I n . Therefore, 

M„ \ 2q \ 


-1 ( k3 


E 


{Vn,l-Ef{V n , l \l n )) 2q 


= E 


/ 


V (r( 1) +p(2) +y(3)\ 

/ v | r n,m 1 r n,m ' v n,m j 


\m=l 


Note that Ef {Vn]m\l^ = 0, Ef (v^m|I n ) = 0, Ef = 0 and they 

are independent over m conditional on I n . Hence by a conditional version 
of Rosenthal’s Inequality (noting that the constants of the inequality does 
not depend on the underlying distribution), we have 
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Mn 

r M n 

2q- 

\ 2 

< CqE/ 

E E / (|h;W + + <i| 29 |I„) + 

£% 

(l^«+^ + ^| 2 In) 


m=l 

v. m=l 

J 



Mn 

$ 

s 

Hi? 

_ 1 

c q 

E e / (ik% + ^ + <ii 29 ) + % 

E % (l<m + + ^rlPlIn) 


m=l 

l m=l J 


(A.3) 


Below we control 

E E / (| V n% + V f% + V $n\ 2q ) and E f 


separately, at the required rate. 



Control of £^1 E f (|vgL + vgL + vg„| 2q ) 



We now provide a general control over terms like E f 

( lyW 4 . t/( 2 ) 

l | vn,m 1 vn,m 

4 . y( 3 ) \2q\ 

To this end, note that, it is enough to control E f (j 

T n ( ^) 29 ) ) E / ( 

'(V®*) 29 ), 


and E f ((K ( 3 rl) 29 ) separately. We have dropped the absolute value sign as¬ 
suming without loss of generality that q is a sufficiently large integer. 
Control of E f ((V$„) 2q ) . 



Note that, by Hoeffding decomposition of U-statistics with asymmetric 
kernel, we have 


tHI) _ N nj m{N n ,m ~ 1 ~ 2) 

n ’ m n(n — l)(n — 2 ) 


N, 


Nn,m 

£ 


n.m ■ 

' n — 


i =1 


^fXj Jx s 


E j 

fXjJx s 
~^^fXj ifX s 

+Vf X jifx s 

^fXjJXs 

~^^fx,ifx s 


X {X^ Xj , Xg') Xj , Xg'j G Xn,m X Xn,m X Xn,ra)|In 

X X s , Xj) Xj , -Xs) G Xn,m X Xn,m X Xn,ra)|Iri 

(.Xj , , -^s) I{(Xi , Xj , ) G Xn,m X Xn,m X Xn,m)|Iri 

5 -X7 ) -^((Xi ? Xj , -X s ) G Xn,m ^ Xn,m X Xn,m)|Iri 
(Xj^Xg^Xi)I((Xi^Xj^Xg) ^ Xn,m ^ Xn,m X Xn,ra)|Iri 
(Xg, Xj, X{) Xj, X s ) G Xn,m ^ Xn,m X Xn,ra)|I?i 


6Ey -R" (-^"2? Xj , -Xs) fL((Xi, Xj , -X$) G Xn,m X Xn,m X Xn,m)|Iri 
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By an application of conditional version of Marcinkiewicz Zygmund In¬ 
equality (noting that the constants of the inequality does not depend on the 
underlying distribution) we have that 


E 


((KS) 29 |Ir 


2q 


r N% m (N ntJn - 1 ) 2 g(JV n , m - 2) 2q N n ^ 

— i n 2q (n — l) 2q (n — 2) 2q N n)Tn 


Nn,m 

E E f 

i =1 


V 


■■‘fXjJXs 


E, 

Jx s 

~^^fXj Jx s 
~^^fXj ifX s 
~^^fXj ifX s 
~^^fXj ifx s 

— 6E/ 


K (Xi , Xj , X s ) X( {X ^, Xj , X s ) G Xn,m X Xn,m X Xn,m)|In 
K {Xi , -AT S , -Xjf) X(5 -X7 ? X s ) G Xn,m X Xn,m X Xn,ra)H 

K {Xj , X^, -Xs) Z{{Xi, -Xj, -X s ) G Xn,m X Xn,m X Xn,m)|Ii 

X (-X s , X^ 5 Xj ) I{{Xi) Xj , .X s ) G Xn,m ^ Xn,m X Xn,m)|Ii 

X {Xj' > X s ,Xi)'X{{Xi' ) Xj' ) X s ) G Xn,m X Xn,m X Xn,m)|Ii 

X {X s , Xj, Xi)'I'{{Xi, Xj, X^ G Xn,m ^ Xn,m X Xn,m)|Ii 

-X {Xi) Xj, X s ) I{{Xi, Xj) X s ) G Xn,m X Xn,m X Xn,m)|I?i 


2 <? 


< c N% m (N n , m ~ l) 29 (iVn,m ~ 2) 2 ? jV n ,f 
— 9 n 2q (n — l) 2 *? (rz, — 2) 2< ? JV nj 

( 


x 


JV„ 


E 

2=1 


^"^E/x^,/x s X (Xi, Xj,X s ') 1 ((Xi,Xj,X s ') G Xn.m X Xn.m X Xn.m ) |In ^ |In) 
"bEy ^ ^ E/x.y ■fx s X (X^. X s . Xj') I((Xi, Xj. X s ) G Xn.m X Xn,m X Xn,m)|In J' |In) 

/ r r~ , . ., . 1 -i 29 . \ 


+E/({l 


E 

+E/ ( |e 
+E/(|e 


/x,,- ,/x s 


/Xj ,/x s 


X (Xj. Xi. Xg)1((Xi. Xj. X s ) G Xn.m X Xn,m X Xn.m,) |I 

X (X s . Xi. Xj') I((Xi. Xj. X s ) G Xn,m X Xn.m X Xn,m)|I 

X (Xj. X s . Xi) Z((Xi. Xj. X s ) G Xn.m X Xn.m X Xn.m) |I 

~^E/ ( 'j E/x a . ,/x s X (X s . Xj. Xi) I((Xi, Xj. X s ) G Xn.m X Xn.m X Xn.m) |I 

(A.4) 


/Xj >/x s 


2 g 


2 g 


2 ? 


1 


4g+l 

Pn.m 


Above, the last inequality follows from conditional Jensen’s Inequality. A 
typical term in the above summand looks like 


’ Xn,m 


' Xn.m J Xn.m • 


EEE 

h ^2 b 


^ ( X )^ 1 1 (^l)^f 2 2 (®M * 2 M 

(* W (*3 ) / (®2 ) / (®3 ) 


2 <? 


dxdx2dx 3 


f(xi)dxi 
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(A.5) 


for some tuple k\ , k- 2 , hi- To evaluate each of the above, we consider three 
different ordering of k\ , k^ ■ k%. Let us consider the the integrals in the square 
bracket first. Since here x\ is fixed, we look for the subinterval of resolution 
k\ where x\ lies. In each of the cases we look at the specific subinterval 
containing x\ for every fixed x±. Further this subinterval needs to intersect 
Xn,m • Say, we index that by k\. Call this subinterval S^x i). Taking the in¬ 
tegral inside the sum over location parameters, we have that each summand 
is bounded by C{$ 0 , ]|/]|oo) • For each hxed x \, the number 

of summands contributing will be bounded by C^ 0 max (CL2,fc3) _ Therefore, 
whichever interval x must lie it should intersect that containing Xi. There¬ 
fore, the term in the square bracket is bounded by C^ 0 for each fixed x±. 
Raising to the power of 2 q and integrating with respect to / over Xn,m yields 
the outer integral to be bounded by C(V’ o, ll/lloo )p n ,m- Therefore, (A.5) is 
always bounded by . Therefore, by (A.4) we have that almost 

Pn,m 

surely, 


E/ 



< C q C^ o, 


N 2q 


.2£ 


(Nn,m-l) 2q (N n , m -2) 2q N n ^ N 7 


n 2<j( n _ f)2(j( n _ 2) 2 9 _/\T n 


X 


2 q 

Pn,m 


(A.6) 


Now, noting that p n: m > and using Lemma 5.6 of Robins et al. (2015), 
we have that 


EfE f ((R n %) 29 |In) < Cq(l|V;olloo J g /lloo ’ /min) , (A.7) 

which is at the desired rate of control. 

Control of E f ((vi 2 L) 2q ) . 

For the second term of the U-statistics it is enough to control for the expec¬ 
tation of 


Pm) _ (A J n,m{N ntm — 1 )(N njm - 1)) 
1 21 ~ ' 


2 q 


n 2q (n — 1 ) 2q (n — 2) 2q 


N n 


~ fix ■) fix ) 

(E/ x . {K(Xi, Xj, X s )T({Xi, Xj, X s ) € X n,m X X n,m X Xn,m)|In ^ dxjdx 


' \n,m \n,m 


Pn,m 


and 
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j(m) _ {Nn,m{N n ,m ~ t( N n,m ~ 1)) 

22 n 2c i(n — l) 2q (n — 2) 2q 


2 q 


1 


N, 


~ fix ■) fix 1 

(E fx XK{X i ,X j ,X a )I{{X i ,X j ,X a ) € Xn,m X Xn,m X Xn,m)|I»)) 2g 2 


71,771 Xn,m ** Xn,m 


p2 

Jrn.m 


Control of ■ 

First note that one can argue by simply looking at orders of truncation that, 


E fXi {K(Xi,Xj,X a )I((X, 

_ L 'L 

Pn ,m + Z^ 2 Xz 3 f x 


Z 5 Xj , -Xg ) G X77, ,m X Xn X Xn 


/ (aO^f (ah)V£? {xMh ( X W 3 3 (®a)/(*i)rf®rf®i 

i / V’fi 3 ( x )V’f 1 3 OjM^ 3 O)^ 3 (x s )f(x 1 )dxdx 


Let us control the first term in the square bracket first. For each fixed value 
of (xj , x s ), find the boxes of length ^ which contains Xj and x s . If these 
boxes are disjoint then they contribute to the summand of the first term. 
Therefore, only way of getting a contribution in the first term is when Xj 
and x s are in the same box of resolution k%. Therefore, x must also belong to 
the same box of resolution k 3 to contribute to the summand. Therefore, the 
first term inside the square bracket is always bounded by C(ii> 0 , ||/||oo) klu n 
for every choice (xj, x s ) lying in the same L 3 resolution box and 0 other¬ 
wise. Therefore, the integral of the square of the first term over (xj, x s ) € 

Xn,m X Xn,m is bounded by C(i/) 0 , ||/||oo) [St] T^jt = ll/l|oo)^f • 

Now, let us look at the second term of the square bracket. Once again fix 
a (xj, x s ) € Xn,m x Xn,m and find the /u’3 resolution box that x s belongs to. 
Now note that Xi,x must belong to the same k% resolution box and hence 
the number of summands contributing to the sum in second term of the 
square bracket is again bounded by C ^ 0 . Therefore, the second term inside 

the square bracket is always bounded by C{ij) 0 , ||/||oo)-T 3 r- for every choice 

k 3 

(xj,x s ) with x s lying in the some k 3 resolution box and 0 otherwise. There¬ 
fore, the integral of the square of the second term over (xj,x s ) € Xn,m x Xn,m 

is bounded by C{iJj 0 , \\f\\oo) (^f) jfk ufc = Since > b y 

our choice, k% M n , the first term of the square bracket dominates af¬ 

ter squaring and integrating over over (xj. x s ) € Xn,m x Xn,m ■ Taking into 
account the division by Pn )Tn ’s, the final contribution to is bounded 
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(c(V’O) Il/||oo)^r) 1 ~k~- Therefore, 

4? < C'dl^ollocH/lloc/min) (N n , m {N n , m - l)(N n , m - l))* -L 

< C(\\Moo, ll/lloo, /min) (N n , m (N n , m - 1 )(N n , m ~ 1 )f q 

(A.8) 

Taking expectations by using Lemma 5.6 of Robins et al. (2015) yields the 
desired control. 

Control of iffl. 

The calculation technique of this term is similar to that of I%± ■ The only 
difference is that instead of taking square of the square bracket term we take 
the 2 q th power. Hence, the integral over (xj,x s ) € Xn,m x Xn,m is bounded 

Taking into account the division by p njm ’s, the final contribution to i s 
bounded 


M| \ 2q J__fe3_ / zq kz_l _1 


+ 


k 3 M n ) k'n M n V k\ ) M n k 3 M, 


2 q 


Wo, ll/lloo) 

Therefore, 

4? < C'dlV’olloo, ll/lloo, /min) (N n>m (N n , m - 1 )(N n>m - 2 )f q 


2q+2 ' 
Pn,m 


k\k 3 \ 2<? 1 k 3 / fcifcg k 3 1 1 

k 3 M n ) iqw n + \jff) W n Y 3 W n 


2 q 


< Cdh/’clloo, ll/lloo, /min) (N n , m (N n , m - l)(N n>m ~ 2)f q ^ 


2q—l 


(A.9) 


Taking expectations by using Lemma 5.6 of Robins et al. (2015) yields the 
desired control. 

Combining Controls of and iffi for control of Ef ( (V n 2 j n ) 2q 


We get 

Er ((vj^)**) < CdlVolU, ll/lloo./. 




n- 


1 

n 4ij-2 
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Control of E f ((vgL) 2 '*) . 

For this part we will need a moment bound for third order U-statistics. We 
do that in the following Lemma. 

Lemma A.6. For any q > 2, there exists a constant C q such that for any 
i.i.d random variables Xi, X 2 ,X m and degenerate symmetric kernel K. 


E 


1 


m (m — 1) {m — 2) 


E K(X h ,X i2 ,X i3 ] 


< C q m~ q E\K{X il ,X i2 ,X i3 )\ q 

We will now apply Lemma A.6 to control moments of the degenerate part 
of the kernel K{X U Xj,X s ) = f K kl (x,X i )K k3 (x,X j )K k3 {x,X s )I{(X i ,X j ,X s ) € 

Xn,m x Xn,m x Xn,m)dx. Standard arguments show that this is the term with 

(3) 

the highest contribution among all the different terms of W,m- Denoting the 
degenerate part of this kernel by we can show by Lemma A.6 and a 

standard contraction of norm under conditional expectation argument that 


% ((O 29 1 r '0 

Ctyo, H/lloo) 9 (iV n , m (iV nim - l)(JV n , m - 2)) 2q 


< 


n 


2q {n - l) 2q {n - 2) 2q 


E 


K% n) (X 1 ,X 2 ,X 3 ) 


2 g 


< 


x E 


N 2q m 

Ctyo, ll/lloo) 9 (JVn,m(iVn,m ~ 1 ){N n , m ~ 2)f q 

n 2q (n — 1 ) 2q {n — 2) 2q 

E | f Khi (pc, X\)K^ (pc, (pC) Xj , Xs) ^ Xn,m X Xn,m X | 

N% m 

Cjipo, ll/lloo) 9 ((N n , m - l)(iW, m - 2)) 2q 
n 2 i(n — 1 ) 2q (n — 2') 2q 

/ {x,Xi)K k3 (x, X 2 )K k3 (x, X 5 )I((Xi,Xj,X s ) € Xn X Xn x Xn,m)dx 

(A.ll) 

Therefore it is enough to control 


2 q 


E 


I K k ^,X 1 )K t ,(x,X 2 )K k 3 ( 1 : ,X 3 )iax l ,X j ,X,) e Xn, m X ,ra x Xn,m)dx 


2 q 
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at a level so that the expectation of the right hand side of (A. 11 ) is controlled 
at the desired level of ■ We do this below. 

Control of E | J Kki (X) -X^l)Ki C 3 (x, X2)Kk 3 (x ; X3)I((Xj, Xj, X s ) € Xn.m X \n.in x Xn,m)dx| 


J Afci (x, Ai) Kk 3 (x, X2)Kfc 3 (x, Xf)I(fXii Xj, -X s ) € Xn,m X Xn,m X Xn,m)dx 

i/ / f [see/ 


2q 


' Xn,m Xn,m J Xn,m ^ ^ ^ 


< 3 (Z2)C(Z)< 3 (*3) 


f(xi)f(x 2 )f(x 3 ) 


dx\dx2dxs 


Now, for each fixed value of (*1, X2, X3), find the boxes of length ^ which 
contains £2 and £3- Therefore, only way of getting a contribution is when X2 
and X3 are in the same box of resolution k 3. Therefore, x must also belong 
to the same box of resolution k$ to contribute to the summand. There- 

/ fcifc 2 \ 2g 

fore, the term inside the square bracket is always bounded by ( ) 


for every choice (x2,xf) lying in the same k 3 resolution box and 0 other¬ 
wise. Therefore, the final integral with respect to (x\,X2,X3) is bounded by 


Cty 0 , 


) 9L ^- 

Jrn.m 


1 1 fc.S _ 

' M n fef n ~ 


f ■ )k 2q ~ 1 n 2g+1 

)5 t /min y /^3 


Combining with (A.11). Therefore, from (A.11), we have that almost 
surely, 



< 


Wo, ll/Hoo) 9 ((N n ,m ~ 1 )(N n>m ~ 2)) 2q 


x E 


n 2g {n — 1 ) 2< ?(n — 2) 2g 

J K^XJK^XJK,, 6 x„, m x x „ x Xn,m)d2 


2q 


< C'gdIV’olloo, 


u/n 


((^n,m — l)(A r n|m — 2)) 2g 2g_! 2g+l 

n 2g (n — l) 2g (n — 2) 2g 3 W 




, 2g-l 

3 ,/min)((iV n , m -l)(lV rl , m -2)) 29 3 


n 


4q—l 


(A. 12 ) 
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Therefore, by Lemma 5.6 of Robins et al. (2015) we have that, 


E, 


'%)*) < c q 


005 || J 11 OO 5 


k 2q ~ l 

f . 'iLs_- r 

' n 4q-l ~ W 


OO) IIJ 11 OO 5 


i /min) 


hV-- 1 

n 2 J 
(A.13) 


Combining (A.7), (A.10) and (A.13) and summing over M n = 0(n ) par¬ 
titions we get the desired bound that 


E % (VS + V$l + V£l\ 2q ) < C q 


005 \\J ||00 


55 /min) 


m= 1 


n 2 ) 


2q 


Control of E f {Em=i E f (|vgL + v£L + vgL| 2 |I„) } 2 . 

Unlike the previous case, here we will will employ a conditional version of 
Gine-Latala inequality for moments of U-statistics along with the observa¬ 
tion that the constants of the inequality does not depend on the underlying 
distribution. The only difference will be that we will have to do an extra 
step of taking the expectation of the moment of sum of functions of multino¬ 
mial coordinate statistics. To deal with this we again use the Marcinkiewicz 
Zygmund Inequality following an use of Theorem 2 of Shao (2000). First, 

2q 

note that it is enough to control E f |X^m=i E/ ((Vn,m) 2 |In) j 2 , 

% {£m=i E / ((Un (2 rl) 2 |In) } \ E / {^ 1 % ((K ( & 2 |In) separately at 
the desired rate. 


Control of Ef 


Note that by (A. 6 ), 


/ )r^M n 


=1 Ef 


VgL) 2 |In)} " q - 


E/ ((v;%) 29 |ln) < Ctyo, ll/lloo) 


<c q 


Q Nlm(N n , m -l) 2q (N n:m -2) 2q 
pn,mn 2q (n — 1 ) 2q {n — 2) 2q 

N q ,m(N njm -l) 2q (N n , m ~2) 2q 


005 II J 11 OO 5 


i /min) 1 


Therefore, 

E/ ((K%) 2 |In) < C 2 


OO) II J ||oo) /min 


(n — 1 ) 2c i(n — 2 ) 2 Q 

iVn,m(iVn,m ~ 1 ) 2 (A~n,m ~ 2) 2 
(n — l) 2 (n — 2) 2 


Therefore, by Theorem 2 of Shao (2000), 


( Mn 

E f \ E / 

V m= 1 


2 q 
2 


n,m/ \ n 


K, m (iv', m -i) 2 (iv; m -2) 2 
< c 2 E/ < 2 ^- 

f m= 1 


2q 

2 


Ti 
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where N' n m , m = 1,... ,M n are independent with the same marginals as 
m = 1,..., M n . Now, 


E f g K,m(K ,m - 1 ) 2 « m - 2) 

lm=l 


2q 

2 'l 2 


77. 


<c 9 


E 


r *' / K.m(K.m-lr« ro -2) 2N ' 9 


+ 




-)} 




E / E 


(A-14) 

The first term in the above summand can be controlled by Marcinkiewicz 
Zygmund Inequality as follows, 

' N' (N' -l) 2 (N' - 2) 2 (N' (N' -1 ) 2 <N' -2) 2 \\] q 

iy n,m\ iy n,m ) \ ±y n,m ^) jg / ± v n^mx 2 - Y n,m ) ) \ \ ( 


< m= l 


rr 


j 


n^ 


Mn 


< c,m«mJ±- ]T % 


N' IN' -1 ) 2 (N' — 2) 2 (N' IN' -1 Y(N' -2) 

n,m\ n,m *-> V n,»n 7 E | A 'n,m\- L 'n,m -*-/ V 'n,m ) 


m =1 


Mn 


rr 

2 (ATI 


n ’ 


< c q M .i y E . 


N' (N f -1 V(N' -2) 

■ L 'n,m\- L 'n 1 m / \ ±y n,m ^J 


\ 2 \ 1 9 


ra=l 


77E 


Therefore, it is enough to control Ej | 


AT' nv' —l) 2 (iV' —2) 2 \ i ^ 

n.m \ n,m J \ n,m / \ I 


)} 


(A.15) 


for this 


part of the proof. But, by Lemma 5.6 of Robins et al. (2015), we have that 
there exists a constant Cg(||/||oo, ipo) such that 

( f N' (N' -1) 2 (N' — 2) 2N \ ^ q 


E 


/ 


By a similar argument, 


< 


CqiWfWoo,^) 


n ’ 


n 




E, 


'K,m( N n,m-±)HK,m-2) 2 \ < C q (\\f ||oo,^o) 


rr 


rr 


(A.16) 


(A.17) 


Combining (A.14) with (A.15), (A.16) and (A.17), we have, 


2 q 


E / | g K,m(K ,m - - 2) 2 \ 2 

lm=l 


n’ 




f-1^ C' 9 (||/|| 0O ,^ ) ) + / CVdl/llocV’o) 


Mn Ml 


n iq 


n 4 

rr 


h 


<^( 11 / 1100 ,^ 0 )- 2 - 
y n z 


This 


is completes the control of Ey |Em=i E / ((Hi,m) 2 |Irc) | 
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Control of E f E f ((V$„) 2 |I„) } ^ . 

Note that by (A.8) and (A.9), 


%(E%((0%)} 

lm=l J 

— Cq(IIV’olloo) Il/Hoo) /min)Ej 
<^(11^,1100, H/lloc/minjE/ 


(&) Em=l (N n , m (N n , m - l)(iV n , m - 2)) 2 
+ (&) 7^ E!=l (Nn,m(N n ,m ~ l)(N n , m - 2)) 2 

(&) Em=l K, m « m - l)« m - 2)) 2 
+ (&) i Em=l K, m « m - l)«, m - 2)) 2 

(A-18) 


where the last inequality in the above display is again by Theorem 2 of Shao 
(2000), with N' n m , m = 1,... , M n are independent with the same marginals 
as N n , m , m = 1,..., M n . Proceeding as the last section, we can apply the 
Marcinkiewicz Zygmund Inequality to a centered version of the right hand 
side of above and obtain that 


( M n 

% E E / 


< m=l 



2 q 
2 


< C'qdIV’olloo, 

< C'qdIV’olloo, 

< C'gdIV’olloo, 



2)) 2 


Q 


2q 

Control of E f {j2™=i Ef ((v£L) 2 |I n ) } 2 • 
Note that by (A.12), 


E / 


<C 2 (Uol 


>, /min) ((iV n , m - 1 )(N n , m - 2)) 2q ^ 




which is similar to the first term of (A. 18) which we controlled earlier. 

□ 
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Proof of Theorem 5.2. 

Proof. The proof is a simple application of Theorem 4.1 with slight mod¬ 
ification to account for the fact that the estimator depends on three trunca¬ 
tion points (ki,k 2 ,ks) instead of single level of truncation parametrization 
considered in Theorem 4.1. However, as noted in equations (5.1), 5.2, and 
Theorem 5.1, the required problem can be equivalently parametrized by the 
highest order of truncation i.e. k%. Since, k% decides a corresponding smooth- 

i' 

ness index /3 as the solution of k^, = n 1 + 4 / 3 , this also decides the other two 
levels of truncation. As a result, the proof of Theorem 4.1 goes through in a 
similar fashion. We do not provide the details here for the sake of brevity. □ 

APPENDIX B: PROOF OF LEMMAS 

Proof of Lemma A.l. 


Proof. 


p / (z(j = 1)) < P / \i{k 0l K) - I(k 0 , A;*) I > CVlogn 


y/kl 


n 


since there exists a constant C</, such that /(fco, &*) < C^k 2 ^° -C Now, 
by Hoeffding decomposition, we have 

I(ko,k *) - I(k 0 ,k *) 

— Un,* U n ,0 Ej ( U n ,* Un } o) 

= —^ 1} Y [K*(Xi,Xj) - K 0 (Xi, Xj) - E f (K^Xi,Xj) - A 0 (X 4 , X,))] 




- Y [ E W {K^Xi,Xj) - K 0 (Xi, Xj)) - E f (K*(Xi, Xj) - A' 0 (Xj, X,))] 


2—1 


+ 


1 v r K * (Xi,Xj) - K 0 (X i} Xj) - E x , (X* (X i} Xj) - K 0 (X h Xj)) 

-1)2-. —E Xi (K*(Xi,Xj) - K 0 (Xi,Xj))+ Ef (K*(Xi,Xj) - K 0 (Xi,Xj)) 


n(n — 1) 




:= T\ + T 2 
Therefore, 


P 


/ 


\I{k 0 , h) - I(ko, h )| > Cy/\ogn 


-y/kl 




n 


\Ti + T 2 \ > Cy/\og 


n- 


n 
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<F f 


|Ti| > C5 n y/\ogn 


\fkl 


n 


+ F f 


\T 2 \ > (7(1 - 5 n )yJ\ogn 


\fkl 


n 


for some sequence 5 n > 0 to be specified later. 
Control ofFf |Ti| > C5 n ^J\ogn^- . 

VK~ 


|7\| > C5 n y/logn 


n 


2 n /k~ 

- J2 [Ex,- (K* (Xi , Xj ) - K Q (Xi,Xj)) - E f (K*(Xi,Xj) - K^X^Xj))] \ > 


n 


i=i 


Let Ri = E Xj(K*(Xi,Xj) — Ko(Xi,Xj)). Then the above display can be 
bounded from above by Markov’s Inequality as follows, 


P, 


< 


o n _ rj— 

-| y j (Ri - Ef(Ri)) I > c5 n "\/log n — — 

i=1 

%E?=il^-%(^)|] 2r 


n 


c5 ny /\ogn^-n 


2 r 


< C r n 2r 


i= 1 

< C r n 2r ^E f [\Ri\ 2r ] 


|2rl 


oWiogn^n 


2 r 


c<5 n Vlogn^n 


2 r 


(B.l) 


where the last two lines is by applications of Marcinkiewicz Zygmund In¬ 
equality followed by Jensen’s Inequality. Next, we evaluate E/-(|i?| 2r ). 


E f (\R\ 2r ) = E fxi 


2 r 


< 2 


2)—1 


E f X2 (K*(Xi,X 2 ) - K 0 (X 1 ,X 2 ))} 

E f Xl {E fXa (IUXi,X 2 ))} 2r +E fXi {E fx2 {Kq i, X 2 ))} 


2 r 


Let us evaluate E f Xi |E/ Xz (Kk{Xi, X 2 ))^ for a general k. We first 
sider, 

k 

(K k (X 1 ,X 2 )) = J2MXi)Kfx 2 (M X ^ 

1=1 


con- 
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= y^^k,l{Xi)ak,i < (sup a k ,i) ^2 \M X i)\ 
1=1 

<c^\\f\\l 


i 


i =l 


where the last line follows by Lemma C.2. Therefore, for some constant 

cVo > o, 

®f Xl {%X 2 (Kk(X 1 ,X 2 ))} 2r <C% 

Therefore from equation B.l, we have 




-| y; (/?., - E/(-Rj)) | > cWlogn^ 

n z —' n 

i=l 


< [|i?| 2r ] 


l 


cdny/logn^n 


2 r 


<<4ll/llL 


n 


r+1 


(c<5 n v / iog^v / ^) 


2r — 


qj/HL 

n 


by choosing r large enough and 5 n > 0 at most a sub-algebraic sequence 
going to 0. To see this note that — = J—. 2 ,- • 

(5n\/\ogn\/k*) [SnV lognj n i+4/^i 

Choosing r > 1 _ 2 4l / 3 1 yields the desire result. 


Control of Pf |T2 1 > C (1 — ( 5 n )Vl°g 


Lemma C.4 and Lemma 


C.5 imply that for some deterministic constant C,j, 0 and all t > 0, 


7 


1 


YH(X U Xj) | > (y/k~t + t+\l-^ + —t 

£—^ J n — 1 \ V n n 


n(n — 1 ) 


*7? 

Using, 2 f 5 < t + t 2 we have, 
C?/) n / 


n — 1 


n n 


< 6 e 


-t 
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|T 2 |> 77l. ( v ^ + t+ ,/^l( t + t 2 ) + fci t 2 

n — 1 V V 4n n 


< 6e 


-t 


Setting t = log n we have, 

T 2 | > -22l. I v /fc 71 ogn + logn + \/^(logn + (logn) 2 ) + — (logn ) 2 
n — 1 \ V 4n n 
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< - 
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Now, since Vk *^ - > max |^, ^^/^(logn + (logn) 2 ), ^^(logn) 2 ! 
and f > ^ for sufficiently large n, we have, 


P 


/ 


|T 2 | > 2(7.00 


y4* log n 


n 


< 


n 


E, 


This is enough to prove the desired result and ends the proof of Lemma 
A.l. □ 

Proof of Lemma A.2. 

(£4,1 - <K/)) 2 "] < 2 2q -% [{U nA - Ef (£4,i)) 2 "] + 2 2<?_1 E / [(E/ (£4, 1 - <K/))) 29 

< 2 2<?_1 E / [(C/ n>1 - E, (t/^, 1 )) 29 ] + C^ll/e^ 2 ^ 0 

< 2 2<?_1 Ej [([ 4 , 1 -E y ( 14 , 1 ))^ 






n 


where the second last inequality follows from Lemma C.l. Therefore, it is 
enough to show that 


sup E f \(U ntl -Ef (£/n,!)) 29 
/6H(/3 0 ) L 


<cjj|/||£? 


h' q 

n 2 


(B.2) 


We will give two proofs of inequality B.2. The first will be valid for any 
kernels based on compactly supported wavelet bases. The second will be 
valid only for Haar wavelets. 


First Proof of Inequality B. 2 . 


E/ 


£4,i — E/ (£4,i) 


2 q 


< c q 


E# 


l E?=1 { E/x.. (Ki(Xi,Xj)) - E f (K 1 (X i , 




2 q 


+E/ 




2 q 


— C q (I\ + If) 


where HfX^Xj) = KfX^Xj) - E fx . (K^Xi, Xj)) - E fx . (KfXi, Xj)) + 

* 0 

Ef(Ki(Xi, Xj)). We first bound I\ by Marcinkiewicz Zygmund Inequality. 
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To this end, first write Rj = E/ Y . (K\ (JQ, Xj)). Therefore, by Marcinkiewicz 
Zygmund Inequality and Jensen’s Inequality, we have, 


h = 2 2 g E/ 


1 


2 q 


-Ew-w) 


i=l 


1 _ n _ 

[(-Ri — Ef(Ri)) 2q ^ 

i= 1 

< 2C' g n _g Ej(|Ri| 2|? ) 

Therefore, it is enough to control Ej(|Rj| 2l? ) = %x, [I EjlK^Xi.Xj)} | 2 ’]. 


EjiK^Xj)} | =| Y,i’iA x i) E fx 1 (’hA x ~i)) I 

Z =1 

=1 ^2oii >k ipi' k (Xi) |< sup(|a^ fc |)^ \ipi,k(Xi)\ 

i=i i=i 

< ^ 11 / 1100 ^^ 11 / 1100 ^ = Cl "'" 2 


Wki 


J il>o IIj I loo 


by Lemma C.2. Therefore, 


^ f m 2q ) <ci 0 \\f wi, 

as well. This in turn implies 

h < ^C q n~ q E,f(\Ri\ 2q ) < 2C q C 2 0 \\f\\ln- q < 2C q C 2 Q \\f\\l 
as required. Therefore, it is enough to show that 

^<cj 0 ii/iig?(h)’ 

i.e. we need to control E/ Yli^j Ri {Xi,Xj)^j where H±(Xi,Xj) = 

Ki(Xi,Xj) - K fx .(Ki(Xi,Xj)) - E fx . (Ki(Xi, Xj)) + E/(RTi(JQ, X,)). Fol- 
lowing the arguments of Lemma C.4 and Lemma C.5, we can also show that 
for all t > 0 , 


1 


n(n — 1 ) 


^R 1 (Xi,X J )| > -^2- (y/hi + t + i + ^t 2 
n — 1 \ Vn n z 

*7 *3 \ 


< 6 e 


-t 
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— 9 

Using 2U <t+t, we have, 

1 


f 


Cl/JQ / 




ki ki 


n{n — 1) 


*7 H 


n — 1 


4n 


4n n 2 


Calling, Xj) = Z, we have an exponential inequality of 

the form, 


P/ 


\Z\ > aiVt + + 03^ 


with a, = ^V/h,a 2 = ^ (y ^ >«3 = ^ (y fe + & J • From 

this we need to estimate, 

poo 

E/dZl 29 ) = 2 q x 2g ~ 1 Ff(\Z\ > x)dx. 

Jo 

We do this as follows. Suppose, t(x) solve for a\y/t{x)+d 2 t(x) + a^t 2 {x) = x. 
Now, trivially d\\/t+d 2 t+d 2 ,t 2 is an increasing function of t € R + . Therefore, 
if we can find a function h{x) > 0 such that ai y/h(x) + d 2 h(x ) + ash 2 (x) < x 
for all x > 0, then h(x) < t{x) for all x > O.It is not too difficult to 
see that one such h(x) is given by h(x) = b\x 2 A b 2 X A b^y/x where for 
b\ = 62 = 63 = for fixed constants ci, 02,03 > 0. Therefore, 

poo 

E f (\Z\ 2q ) = 2q x 2q ~ l E f {\Z\ > x)dx 


C,,, 


< 6e 


ca 


—t 


< 2 q 


< 12 q 
= 12 q 

< 12 q 

= 12 q 


poo 

/ x 2 q ~ 1 ¥f(\Z\ > ai\/ h(x) + d2h(x) + a^,h 2 (x))dx 

Jo 

poo 

Jo 

poo 

Jo 

\f 


2q-l e -h(x) dx 

,2q-l e ~{bix 2 /\b 2 X/\b 3 V^} d , 


X 


2 q- 1 e - bix2 dx + J x 2 i-'e-^dx + I ~ x^-'e-^da 


'm . r(2 q) 2T(4g)\ „ (h 

2 b\ + bf bf ) - *> U 2 


3 

by our choices of 61, 62, ^3 • This completes the first proof. 
Proof of Lemma A. 3 . Pick any l < l c (C*). Then 

p fO = 0 < < C 2 pt log nR(kj)) 
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= P / {l(k *, kj) - I(kt,k*) < C opt ^l^^R(k*) ~ m,k \*)) 

- F f ( i(klk *) - I(kt,k*) < — Copt ^/\ogn^R(kj) - I(kf, k*)) 

(B.3) 

Recall that by l < l c (C*). we have that I 2 (ki,k*) > C* log nR( k*). We 
consider two cases. 

Case 1: I(k*,k?) > ^C* log nR(kt). 

In this case we have that the right hand side of (B.3) is bounded by 

F f (j(k*,kj) - I{k *, kj) < C opt ^7^ R(k*) - ^Jc* lognR(k*)) 

+ Ff (/(*?, kj) - I(k;,k*) < -C op t^/]ogn^JR(k*) - JC * log nR{k*]) 

^ Ci/jo 

— 5 

n 

where the last line follows from the proof of Lemma A.l along the lines of 
Lemma C.4 and C.5, provided C opt is according to the constant specified in 
Theorem 3.1 and C* > 4 C 2 pt . 

Case 2: I(k*,kt) < -yjc* lognRfkj j. 

In this case we have that the right hand side of (B.3) equals, 

1 - F / (l(kf, k*) - I(kf,k*) > C op t-^/\ogn^j R(k*) - yj C* log nR{k*)) 

- 1 + F/ [l{k *, k*) - I(kt,k*j) > ~C op t^/\ogn^JR(k*) - ^C* \ognR{k*)) 

— ? 
n 

where the last line once again follows from the proof of Lemma A.l along 
the lines of Lemma C.4 and C.5, provided C opt is according to the constant 
specified in Theorem 3.1 and C* > 4Co pt . 

Proof of Lemma A.4. 

Proof. 

p fO >j + 2)<F f {31 > j : i 2 (k* j+l , kt) > C 2 opt log ni2(fc?)) 

N—l 

< ]T Ff {l 2 (k* +1 , k*) > C 2 pt lognR(k*)^j 
1 = 3 +2 
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Now, for any l > j + 2. 

P/ [l 2 (k* +1 ,k*)>C 2 opt \ognR(k*)) 

= P f (l(k* j+1 ,kf) - I(k] +1 ,kf) > C opt V - I(k* j+ 1 , fcf)) 

+p f (/(^ +1 , fcf) - j(*; + 1 , fcf) < -c opt vr^^R(kf) - i(k* j+1 , kfj). 


Thereafter, note that 


i 2 (k* +1 ,kf) 

wn 


— 



, which has a power of n 


equal to i^p t ~ i+ 4 p. +1 — 0 since (3i < (ij+\ < /?/. Hence arguing as 
in proof of Lemma A.l along the lines of Lemma C.4 and C.5, we have 
the desired result provided Copt is according to the constant specified in 
Theorem 3.1. □ 


Proof of Lemma A.5. 


Proof. We begin by noting that it is enough to control 


/ / / / 

J ~Yn..m. J ~Ym..m. J Ym .m. 'J 


K kl (x, xi)K k2 (x, x 2 )K ka (x, x 3 )f(xi)f{x 2 )f(x 3 )dxdxidx 2 dx 3 


for arbitrary M n < k\ < k 2 < k 3 . We assume that k\ divides k 2 , k 2 divides k 3 
and M n divides k \. The proof for a general tuple follows by similar arguments 
with suitable obvious modifications. The above integral equals, 


/ / /eee{ 


li h h v 


(®) 

^(x 2 )^(x)^(x 3 ) 


~[f(xi)dxi 


where ip^ix) is k dilated an l shifted wavelet bases. Taking the integral inside 
the summation, any of the summand equals at most C(ip o, \\f\\oo)j^kik 2 k 3 klk2ka 

— 11 . The reason being, each of the summand corresponds to an inte¬ 
gral of x 3 ,x 2 ,x 3 over intervals of length and ^ respectively. More¬ 

over, since k 3 is the finest refinement, the integral over x is simply over an 
interval of length . The bound on the summand then follows. Therefore, 
we now need to count the number of summands that contribute to the sum 
above. This can be argued as follows. For every given subinterval Xn,m there 
are less than C^ 0 j^- subintervals of length with support intersecting 
that of some ip^. For each given subinterval of length -A 2 - with support in¬ 
tersecting that of some V’f 1 1 , there are less than C^, 0 subintervals of length 
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C h 

with support intersecting that of some ^ 2 2 . Finally, for each given subin- 

terval of length with support intersecting that of some ^ 2 2 , there are 

less than C^ 0 subintervals of length with support intersecting that of 

some 'i/^ 3 . Therefore, total number of terms contributing to the sum above 

is at most C^o = This implies that in absolute value, each 

a n ,m is at most C(ijj 0 , ll/IU)^]^ = WfWoo) jfc, as promised. □ 

Proof of Lemma A.6. 


Proof. We evoke Proposition 2.4 of Gine, Latala and Zinn (2000) (page 
9) which implies that for a universal constant depending on q 


E 


1 


m (m — 1 ) (to — 2 ) 


£ K(X h ,X i2 ,X i3 ] 


* 1^*2 ^*3 


(m (m — 1) (m — 2)) q C q < 


m 


3q/2£j 


+m q E 


max. 


E(K 2 (X h ,X i2 ,X i3 )\X i3 )\ g/2 
E(K 2 (X h ,X i2 ,X i3 )\X i3 )\ q/2 


+m q / 2 E [maxj 3> i 2 |E [K 2 (X h ,X i2 ,X i3 ) \X i3 , X l2 )\ q/2 
+E [maxi 3ii2jil |(K (X h , X i2 , Xj 3 ))| 9 ] 

(B.4) 


Now, we have that for Z \,..., Z m > 0 identically distributed, possibly de¬ 
pendent with E (Z( m )) < mE {Z{) = E Z^j where Z (m) = max ie { lr }in y Z t . 


So that B.4 is bounded above by 

C q (to (to — 1) (m — 2))~ q C q < 
< C q m- q E[\K(X h ,X i2 ,X i3 )\ q ] 


m 3q / 2 E[\K(X. h ,X i2 ,X i3 )\ q } 
+m q+1 E[\K(X h ,X i2 ,X i3 )\ q ] 
+m q / 2+2 E[\K(X ll ,X l2 ,X i3 )\ q } 
+m 3 E[\K(X il ,X i2 ,X i3 )\ q } 


□ 


APPENDIX C: TECHNICAL LEMMAS 
The following lemma regarding the bias of U.% will be used regularly. 
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Lemma C.l. Suppose f S H(f$). Then for compactly supported wavelet 
bases 

sup | E f {U*) - <j>{f) \< C^WfW^k-Wf. 
f 

Proof. The proof involves simple algebra and properties of compactly 
supported wavelet bases (Hardle et al., 1998) and hence is omitted. 

We will be using the following lemma about properties of compactly sup¬ 
ported wavelets. 

Lemma C.2. For kernels based on compactly supported wavelets, as de¬ 
fined by Equation (2.1), the following hold. 

T ll/fclloo E C^o 11 /11 oo ■ 

2 . sup, K,| < ^4^. 

3 • su Px Eli \^k,i(x)\ < C^.k 

Proof. Once again, the proofs follow from simple algebra and properties 
of compactly supported wavelet bases (Hardle et al., 1998). □ 

□ 


We will also need the following asymptotic normality of the quadratic 
estimators (Robins et al., 2015). 


Lemma C.3. 


For f € H(j3, c ) we have 


u* w -E f (t/£ (/3) ) 
Jvar f (U^ P) ) 


=> N(0, 1) where 


Var f 

We will need the following lemma about tails of second order U-statistics. 
Lemma C.4. Let, 


H(X, Y ) = K.(X, Y) - K 0 (X, Y) - E Y (K.(X, Y) - K 0 (X, Y)) 

- E x {K*(X, Y) - K 0 (X, Y)) + E f {K*{X, Y) - K 0 (X, Y)) 

. Then the following exponential inequality holds for every t > 0 and a 
deterministic constant C > 0. 


P/ 


'y ' H ( Xi , Xj) | > C (Hx 'Jt + A2F + A$t 2 + A^) 
i¥=i 


< 5.6e~\ 
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where 

A 2 = n(n-l)E f [H 2 (X 1 ,X 2 )\, 


A 2 = sup 


1% [jl = iE%\H(X ll X 2 )a l (X 1 )b j (X 2 )\ | : 

% (EILXTO) < 1 ,% (Z]=ib 2 j(X 2 )) < 1 


A\ = nsup{E fx2 (H 2 (x,X 2 ))}, 

X 

A 4 = sup \H(x, y)\. 

x,y 


Proof. This follows directly from Theorem 3.4 of Houdre and Reynaud-Bouret 
(2003). □ 


We now estimate the terms A \, A 2 , A%, A 4 in the following lemma. 


Lemma C.5. For kernels based on compactly supported wavelet bases, 
there exists deterministic constant C^ 0 (depending on ipo and ||/||ooj such 
that, 


A 2 < C(p^n{rt 1 )(/l*T&o) ; A 2 < C^ 0 n, A\ < C , ^, 0 ri(fe*+A;o), A 4 < Cip 0 (k*-\-ko). 


Proof. The proof follows along the same line of arguments as those laid 
down in Proposition 2 of Bull and Nickl (2013). □ 
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